This is the chapter 1&2 Project for stat 321. We’ll explore the relationship between energy levels of a song and the year the song was released using data from https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023.

Simple linear regression is a statistical technique that helps us understand how changes in one variable (called independent/explanatory/predictor variable) can predict changes in another (called dependent/response/outcome variable).

First, we shall prepare the data:

spotify_data <- read.csv(file = "spotify-2023.csv", 
                        header = TRUE, 
                        sep = ",")

# Print variables
names(spotify_data)
##  [1] "track_name"           "artist.s._name"       "artist_count"        
##  [4] "released_year"        "released_month"       "released_day"        
##  [7] "in_spotify_playlists" "in_spotify_charts"    "streams"             
## [10] "in_apple_playlists"   "in_apple_charts"      "in_deezer_playlists" 
## [13] "in_deezer_charts"     "in_shazam_charts"     "bpm"                 
## [16] "key"                  "mode"                 "danceability_."      
## [19] "valence_."            "energy_."             "acousticness_."      
## [22] "instrumentalness_."   "liveness_."           "speechiness_."

Check dimension

dim(spotify_data)
## [1] 953  24

Print first 10 observations

head(spotify_data,n=10)
##                             track_name             artist.s._name artist_count
## 1  Seven (feat. Latto) (Explicit Ver.)           Latto, Jung Kook            2
## 2                                 LALA                Myke Towers            1
## 3                              vampire             Olivia Rodrigo            1
## 4                         Cruel Summer               Taylor Swift            1
## 5                       WHERE SHE GOES                  Bad Bunny            1
## 6                             Sprinter          Dave, Central Cee            2
## 7                      Ella Baila Sola Eslabon Armado, Peso Pluma            2
## 8                             Columbia                    Quevedo            1
## 9                             fukumean                      Gunna            1
## 10                     La Bebe - Remix      Peso Pluma, Yng Lvcas            2
##    released_year released_month released_day in_spotify_playlists
## 1           2023              7           14                  553
## 2           2023              3           23                 1474
## 3           2023              6           30                 1397
## 4           2019              8           23                 7858
## 5           2023              5           18                 3133
## 6           2023              6            1                 2186
## 7           2023              3           16                 3090
## 8           2023              7            7                  714
## 9           2023              5           15                 1096
## 10          2023              3           17                 2953
##    in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1                147 141381703                 43             263
## 2                 48 133716286                 48             126
## 3                113 140003974                 94             207
## 4                100 800840817                116             207
## 5                 50 303236322                 84             133
## 6                 91 183706234                 67             213
## 7                 50 725980112                 34             222
## 8                 43  58149378                 25              89
## 9                 83  95217315                 60             210
## 10                44 553634067                 49             110
##    in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                   45               10              826 125   B Major
## 2                   58               14              382  92  C# Major
## 3                   91               14              949 138   F Major
## 4                  125               12              548 170   A Major
## 5                   87               15              425 144   A Minor
## 6                   88               17              946 141  C# Major
## 7                   43               13              418 148   F Minor
## 8                   30               13              194 100   F Major
## 9                   48               11              953 130  C# Minor
## 10                  66               13              339 170   D Minor
##    danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1              80        89       83             31                  0
## 2              71        61       74              7                  0
## 3              51        32       53             17                  0
## 4              55        58       72             11                  0
## 5              65        23       80             14                 63
## 6              92        66       58             19                  0
## 7              67        83       76             48                  0
## 8              67        26       71             37                  0
## 9              85        22       62             12                  0
## 10             81        56       48             21                  0
##    liveness_. speechiness_.
## 1           8             4
## 2          10             4
## 3          31             6
## 4          11            15
## 5          11             6
## 6           8            24
## 7           8             3
## 8          11             4
## 9          28             9
## 10          8            33
  1. Next, we shall trim the data set and check the dimensions again, as we are only interested in the years after 2000, we specify that by :
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]
##Check dimension
dim(subset_data)
## [1] 904  24

#Print first 10 observations

head(subset_data,n=10)
##                             track_name             artist.s._name artist_count
## 1  Seven (feat. Latto) (Explicit Ver.)           Latto, Jung Kook            2
## 2                                 LALA                Myke Towers            1
## 3                              vampire             Olivia Rodrigo            1
## 4                         Cruel Summer               Taylor Swift            1
## 5                       WHERE SHE GOES                  Bad Bunny            1
## 6                             Sprinter          Dave, Central Cee            2
## 7                      Ella Baila Sola Eslabon Armado, Peso Pluma            2
## 8                             Columbia                    Quevedo            1
## 9                             fukumean                      Gunna            1
## 10                     La Bebe - Remix      Peso Pluma, Yng Lvcas            2
##    released_year released_month released_day in_spotify_playlists
## 1           2023              7           14                  553
## 2           2023              3           23                 1474
## 3           2023              6           30                 1397
## 4           2019              8           23                 7858
## 5           2023              5           18                 3133
## 6           2023              6            1                 2186
## 7           2023              3           16                 3090
## 8           2023              7            7                  714
## 9           2023              5           15                 1096
## 10          2023              3           17                 2953
##    in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1                147 141381703                 43             263
## 2                 48 133716286                 48             126
## 3                113 140003974                 94             207
## 4                100 800840817                116             207
## 5                 50 303236322                 84             133
## 6                 91 183706234                 67             213
## 7                 50 725980112                 34             222
## 8                 43  58149378                 25              89
## 9                 83  95217315                 60             210
## 10                44 553634067                 49             110
##    in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                   45               10              826 125   B Major
## 2                   58               14              382  92  C# Major
## 3                   91               14              949 138   F Major
## 4                  125               12              548 170   A Major
## 5                   87               15              425 144   A Minor
## 6                   88               17              946 141  C# Major
## 7                   43               13              418 148   F Minor
## 8                   30               13              194 100   F Major
## 9                   48               11              953 130  C# Minor
## 10                  66               13              339 170   D Minor
##    danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1              80        89       83             31                  0
## 2              71        61       74              7                  0
## 3              51        32       53             17                  0
## 4              55        58       72             11                  0
## 5              65        23       80             14                 63
## 6              92        66       58             19                  0
## 7              67        83       76             48                  0
## 8              67        26       71             37                  0
## 9              85        22       62             12                  0
## 10             81        56       48             21                  0
##    liveness_. speechiness_.
## 1           8             4
## 2          10             4
## 3          31             6
## 4          11            15
## 5          11             6
## 6           8            24
## 7           8             3
## 8          11             4
## 9          28             9
## 10          8            33
  1. Next, we shall visualize the data using two variables, the energy and the release year after 2000, this is to see the relationship between them:
# Subset data for years 2000 and onwards
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]

# Scatter plot
plot(formula = energy_. ~ released_year, 
     data = subset_data, 
     main = "Scatter Plot of Energy vs Released Year",
     xlab = "Released Year",
     ylab = "Energy",
     col = "blue",
     pch = 19
)

Visualizing using ggplot:

# Install and load ggplot2 if not already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
library(ggplot2)

# Subset data for years 2000 and onwards
subset_data <- spotify_data[spotify_data$released_year >= 2000, ]

# Scatter plot with ggplot2
ggplot(subset_data, aes(x = released_year, y = energy_.)) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Scatter Plot of Energy vs Released Year (2000 and onwards)",
       x = "Released Year",
       y = "Energy") +
  scale_x_continuous(breaks = seq(min(subset_data$released_year), max(subset_data$released_year), by = 2)) +
  theme_minimal()

Fitting the simple linear regression model:

subset_data <- spotify_data[spotify_data$released_year >= 2000, ]
model <- lm(energy_. ~ released_year, data = subset_data)
summary(model)
## 
## Call:
## lm(formula = energy_. ~ released_year, data = subset_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.411 -10.977   1.519  12.528  32.554 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)   135.32868  288.30808   0.469    0.639
## released_year  -0.03506    0.14269  -0.246    0.806
## 
## Residual standard error: 16.27 on 902 degrees of freedom
## Multiple R-squared:  6.691e-05,  Adjusted R-squared:  -0.001042 
## F-statistic: 0.06036 on 1 and 902 DF,  p-value: 0.806

Intercept (Constant): 135.32868

Interpretation: The estimated mean value of the dependent variable (energy_) when the independent variable (released_year) is zero. In the context of the model, this interpretation may not be practically meaningful, given that release years typically start from a non-zero value.

Released Year: -0.03506

Interpretation: The estimated change in the mean energy_ for each one-unit increase in released_year. In this case, a negative coefficient suggests a slight decrease in energy_ as the release year increases. However, the p-value associated with this coefficient is high (0.806), indicating that we do not have sufficient evidence to reject the null hypothesis that the true coefficient is zero. These coefficients, along with their standard errors, t-values, and p-values, provide insights into the linear relationship between energy_ and released_year. In this specific model, the low R-squared value (6.691e-05) and the high p-value for the released_year coefficient suggest that the model does not explain much of the variability in energy_ and that the relationship between these variables may not be statistically significant.

P-values

p_value <- summary(model)$coefficients[2, 4]
plot(model, which=1)

plot(model, which=2)

plot(model, which=3)

plot(model, which=4)

Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients. Most important points are 359,631 and 912 with 912 being the most influential

R-squared (Coefficient of Determination) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. It quantifies the goodness of fit of the model and indicates the proportion of variability in the response variable that is explained by the predictor variable(s). R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. While R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.

R-squared in this case is 6.691e−05

Finally, let’s make predictions based on our model for new data points.

# New data with sugar values
new_data <- data.frame(subset_data= c(5, 10, 15),  released_year = c(2005, 2010, 2015))

# Predict calories for new data
predictions <- predict(model, newdata = new_data)

# Display predictions
predictions
##        1        2        3 
## 65.04214 64.86686 64.69158

In conclusion, we have gone through the entire process of simple linear regression analysis using real data. We’ve covered data preparation, visualization, model fitting, interpretation of coefficients and p-values, model checking, R-squared evaluation, and making predictions. The interpretation of diagnostic plots in model checking ensures the adequacy of the model by examining linearity, normality, constant variance, and identifying influential points.