This is the chapter 1&2 Project for stat 321. We’ll explore the relationship between energy levels of a song and the year the song was released using data from https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023.
Simple linear regression is a statistical technique that helps us understand how changes in one variable (called independent/explanatory/predictor variable) can predict changes in another (called dependent/response/outcome variable).
First, we shall prepare the data:
spotify_data <- read.csv(file = "spotify-2023.csv",
header = TRUE,
sep = ",")
# Print variables
names(spotify_data)
## [1] "track_name" "artist.s._name" "artist_count"
## [4] "released_year" "released_month" "released_day"
## [7] "in_spotify_playlists" "in_spotify_charts" "streams"
## [10] "in_apple_playlists" "in_apple_charts" "in_deezer_playlists"
## [13] "in_deezer_charts" "in_shazam_charts" "bpm"
## [16] "key" "mode" "danceability_."
## [19] "valence_." "energy_." "acousticness_."
## [22] "instrumentalness_." "liveness_." "speechiness_."
Check dimension
dim(spotify_data)
## [1] 953 24
Print first 10 observations
head(spotify_data,n=10)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## 7 Ella Baila Sola Eslabon Armado, Peso Pluma 2
## 8 Columbia Quevedo 1
## 9 fukumean Gunna 1
## 10 La Bebe - Remix Peso Pluma, Yng Lvcas 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## 7 2023 3 16 3090
## 8 2023 7 7 714
## 9 2023 5 15 1096
## 10 2023 3 17 2953
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## 7 50 725980112 34 222
## 8 43 58149378 25 89
## 9 83 95217315 60 210
## 10 44 553634067 49 110
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## 7 43 13 418 148 F Minor
## 8 30 13 194 100 F Major
## 9 48 11 953 130 C# Minor
## 10 66 13 339 170 D Minor
## danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1 80 89 83 31 0
## 2 71 61 74 7 0
## 3 51 32 53 17 0
## 4 55 58 72 11 0
## 5 65 23 80 14 63
## 6 92 66 58 19 0
## 7 67 83 76 48 0
## 8 67 26 71 37 0
## 9 85 22 62 12 0
## 10 81 56 48 21 0
## liveness_. speechiness_.
## 1 8 4
## 2 10 4
## 3 31 6
## 4 11 15
## 5 11 6
## 6 8 24
## 7 8 3
## 8 11 4
## 9 28 9
## 10 8 33
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]
##Check dimension
dim(subset_data)
## [1] 904 24
#Print first 10 observations
head(subset_data,n=10)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## 7 Ella Baila Sola Eslabon Armado, Peso Pluma 2
## 8 Columbia Quevedo 1
## 9 fukumean Gunna 1
## 10 La Bebe - Remix Peso Pluma, Yng Lvcas 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## 7 2023 3 16 3090
## 8 2023 7 7 714
## 9 2023 5 15 1096
## 10 2023 3 17 2953
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## 7 50 725980112 34 222
## 8 43 58149378 25 89
## 9 83 95217315 60 210
## 10 44 553634067 49 110
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## 7 43 13 418 148 F Minor
## 8 30 13 194 100 F Major
## 9 48 11 953 130 C# Minor
## 10 66 13 339 170 D Minor
## danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1 80 89 83 31 0
## 2 71 61 74 7 0
## 3 51 32 53 17 0
## 4 55 58 72 11 0
## 5 65 23 80 14 63
## 6 92 66 58 19 0
## 7 67 83 76 48 0
## 8 67 26 71 37 0
## 9 85 22 62 12 0
## 10 81 56 48 21 0
## liveness_. speechiness_.
## 1 8 4
## 2 10 4
## 3 31 6
## 4 11 15
## 5 11 6
## 6 8 24
## 7 8 3
## 8 11 4
## 9 28 9
## 10 8 33
# Subset data for years 2000 and onwards
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]
# Scatter plot
plot(formula = energy_. ~ released_year,
data = subset_data,
main = "Scatter Plot of Energy vs Released Year",
xlab = "Released Year",
ylab = "Energy",
col = "blue",
pch = 19
)
Visualizing using ggplot:
# Install and load ggplot2 if not already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
# Subset data for years 2000 and onwards
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]
# Scatter plot with ggplot2
ggplot(subset_data, aes(x = released_year, y = energy_.)) +
geom_point(color = "blue", size = 3) +
labs(title = "Scatter Plot of Energy vs Released Year (2000 and onwards)",
x = "Released Year",
y = "Energy") +
scale_x_continuous(breaks = seq(min(subset_data$released_year), max(subset_data$released_year), by = 2)) +
theme_minimal()
Fitting the simple linear regression model:
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]
model <- lm(energy_. ~ released_year, data = subset_data)
summary(model)
##
## Call:
## lm(formula = energy_. ~ released_year, data = subset_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.411 -10.977 1.519 12.528 32.554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.32868 288.30808 0.469 0.639
## released_year -0.03506 0.14269 -0.246 0.806
##
## Residual standard error: 16.27 on 902 degrees of freedom
## Multiple R-squared: 6.691e-05, Adjusted R-squared: -0.001042
## F-statistic: 0.06036 on 1 and 902 DF, p-value: 0.806
Intercept (Constant): 135.32868
Interpretation: The estimated mean value of the dependent variable (energy_) when the independent variable (released_year) is zero. In the context of the model, this interpretation may not be practically meaningful, given that release years typically start from a non-zero value.
Released Year: -0.03506
Interpretation: The estimated change in the mean energy_ for each one-unit increase in released_year. In this case, a negative coefficient suggests a slight decrease in energy_ as the release year increases. However, the p-value associated with this coefficient is high (0.806), indicating that we do not have sufficient evidence to reject the null hypothesis that the true coefficient is zero. These coefficients, along with their standard errors, t-values, and p-values, provide insights into the linear relationship between energy_ and released_year. In this specific model, the low R-squared value (6.691e-05) and the high p-value for the released_year coefficient suggest that the model does not explain much of the variability in energy_ and that the relationship between these variables may not be statistically significant.
P-values
p_value <- summary(model)$coefficients[2, 4]
plot(model, which=1)
plot(model, which=2)
plot(model, which=3)
plot(model, which=4)
Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients. Most important points are 359,631 and 912 with 912 being the most influential
R-squared (Coefficient of Determination) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. It quantifies the goodness of fit of the model and indicates the proportion of variability in the response variable that is explained by the predictor variable(s). R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. While R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.
R-squared in this case is 6.691e−05
Finally, let’s make predictions based on our model for new data points.
# New data with sugar values
new_data <- data.frame(subset_data= c(5, 10, 15), released_year = c(2005, 2010, 2015))
# Predict calories for new data
predictions <- predict(model, newdata = new_data)
# Display predictions
predictions
## 1 2 3
## 65.04214 64.86686 64.69158
In conclusion, we have gone through the entire process of simple linear regression analysis using real data. We’ve covered data preparation, visualization, model fitting, interpretation of coefficients and p-values, model checking, R-squared evaluation, and making predictions. The interpretation of diagnostic plots in model checking ensures the adequacy of the model by examining linearity, normality, constant variance, and identifying influential points.