library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data

Refer to the simple linear regression model you built last week. Include 1-3 more variables into your regression model.

# Fit the multiple linear regression model
lm_model <- lm(popularity ~ danceability + energy + valence, data = data)

# Display model summary
summary(lm_model)

Call:
lm(formula = popularity ~ danceability + energy + valence, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.121 -16.370   0.377  18.790  66.029 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    51.582      1.790  28.810   <2e-16 ***
danceability   -3.361      1.790  -1.878   0.0605 .  
energy        -18.827      1.516 -12.422   <2e-16 ***
valence         1.459      1.233   1.183   0.2367    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.13 on 8996 degrees of freedom
Multiple R-squared:  0.01866,   Adjusted R-squared:  0.01833 
F-statistic: 57.01 on 3 and 8996 DF,  p-value: < 2.2e-16

Interpretation of the Multiple Linear Regression Output

Here, we fitted a multiple linear regression model with popularity as the dependent variable and danceability, energy, and valence as independent variables.

1. Model Equation

The regression equation based on the coefficients: \[ \text{popularity} = 51.3749 - 2.4805 \cdot \text{danceability} - 18.9689 \cdot \text{energy} + 0.6928 \cdot \text{valence} \]

2. Coefficient Analysis

  • Intercept (51.3749, p < 2e-16)
    • When all predictors are 0, the predicted popularity is 51.37.
    • The intercept is statistically significant (p-value < 0.001*).
  • Danceability (-2.4805, p = 0.165)
    • Has a negative effect on popularity, but not statistically significant (p-value > 0.05).
    • This means that danceability does not have a strong impact on popularity in the data.
  • Energy (-18.9689, p < 2e-16)**
    • Has a strong negative impact on popularity.
    • Statistically significant (p-value < 0.001*), meaning higher energy is associated with lower** popularity.
  • Valence (0.6928, p = 0.574)
    • Very weak positive effect on popularity.
    • Not statistically significant (p-value > 0.05), meaning valence does not strongly influence popularity.

3. Model Performance

  • Multiple R-squared = 0.01943 (1.94%)
    • The model explains only ~1.94% of the variability in popularity.
    • This means the model is not a good predictor of popularity.
  • Adjusted R-squared = 0.01911
    • Adjusts for the number of predictors but is still very low.
  • F-statistic = 59.43, p-value < 2.2e-16
    • The model as a whole is statistically significant, meaning at least one predictor has an effect.

Key Takeaways

  1. Energy is the only significant predictor (p < 0.001) and has a strong negative effect on popularity.
  2. Danceability and valence are not significant, meaning they do not reliably predict popularity.
  3. The model has very low predictive power (R² = 1.94%), indicating that other factors influence song popularity.

data|> 
  ggplot(mapping = aes(x = energy, y = popularity, color = danceability)) + 
  geom_point(size = 0.5) + 
  geom_smooth(method = 'lm', se = FALSE, color = 'Orange') + 
  geom_hline(yintercept = mean(data$popularity, na.rm = TRUE), linetype = 'dashed') + 
  theme_minimal()

The orange regression line slopes downward, indicating a negative correlation between energy and popularity. This suggests that songs with higher energy tend to have lower popularity, and vice versa.

Try out either an interaction term or a binary term to start.

Adding an interaction term

lm_model_interaction <- lm(popularity ~ danceability * energy + valence, data = data)
summary(lm_model_interaction)

Call:
lm(formula = popularity ~ danceability * energy + valence, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.526 -16.377   0.365  18.829  65.984 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           54.636      4.384  12.461  < 2e-16 ***
danceability          -8.099      6.461  -1.253    0.210    
energy               -22.707      5.305  -4.281 1.88e-05 ***
valence                1.309      1.249   1.049    0.294    
danceability:energy    6.349      8.319   0.763    0.445    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.13 on 8995 degrees of freedom
Multiple R-squared:  0.01872,   Adjusted R-squared:  0.01829 
F-statistic:  42.9 on 4 and 8995 DF,  p-value: < 2.2e-16

energy is the only statistically significant predictor (p-value = 4.93e-05), indicating a strong relationship with popularity. danceability, valence, and the interaction term danceability:energy have high p-values (> 0.05), meaning their effects are not statistically significant.We have to include valence as it has a significant relationship with the target variable and enhances model performance. It can be excluded if it introduces multicollinearity with other features or doesn’t improve predictive power.

Consider adding other integer or continuous variables.

Adding Another Continuous Variable: Tempo


lm_model_tempo <- lm(popularity ~ danceability + energy + tempo, data = data)
summary(lm_model_tempo)

Call:
lm(formula = popularity ~ danceability + energy + tempo, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-48.17 -16.12   0.27  18.69  65.75 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   47.257372   2.069654  22.833  < 2e-16 ***
danceability  -1.720256   1.625482  -1.058     0.29    
energy       -19.054905   1.497382 -12.725  < 2e-16 ***
tempo          0.033972   0.008441   4.025 5.75e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.11 on 8996 degrees of freedom
Multiple R-squared:  0.02027,   Adjusted R-squared:  0.01994 
F-statistic: 62.04 on 3 and 8996 DF,  p-value: < 2.2e-16

The model is statistically significant overall (based on F-statistic).

Among predictors:

energy has a strong negative association with popularity and is highly significant.

tempo has a positive association with popularity and is also significant.

danceability does not have a statistically significant effect on popularity.

The low R-squared value suggests that this model explains very little of the variability in popularity, indicating that other factors may better predict popularity. Tempo is to be included in a model as it influences the target variable, such as song popularity, user engagement, or mood-based playlists, as it directly affects the song’s energy and rhythm. However, we can exclude it if it is highly correlated with other features like energy or danceability, leading to multicollinearity or redundancy.

Your model for this data dive should have 2-4 terms.

Yes, for this data dive, my model has 4 terms-

Valence: Captures the mood or emotional content of a song (positive/negative).

Tempo: Represents the speed of the song, influencing energy and rhythm.

Energy (optional): Reflects the intensity of a track, which could correlate with tempo and valence.

Danceability (optional): Measures how suitable a song is for dancing, which often correlates with tempo and energy.

# Define the model with Valence, Tempo, Energy, and Danceability as predictors
model <- lm(popularity ~ valence + tempo + energy + danceability, data = data)

# View the summary of the model
summary(model)

Call:
lm(formula = popularity ~ valence + tempo + energy + danceability, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.163 -16.192   0.269  18.678  65.931 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   47.415608   2.077684  22.821  < 2e-16 ***
valence        1.073530   1.235916   0.869    0.385    
tempo          0.033391   0.008468   3.943  8.1e-05 ***
energy       -19.274938   1.518678 -12.692  < 2e-16 ***
danceability  -2.402586   1.805364  -1.331    0.183    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.11 on 8995 degrees of freedom
Multiple R-squared:  0.02035,   Adjusted R-squared:  0.01992 
F-statistic: 46.72 on 4 and 8995 DF,  p-value: < 2.2e-16

The model suggests that tempo and energy are significant predictors of popularity, with tempo having a positive impact and energy a negative one. Valence and danceability are not statistically significant predictors. The model explains only 2.14% of the variance in popularity (R-squared = 0.02141), indicating a weak fit. Despite this, the overall model is statistically significant (p-value < 2.2e-16).

Evaluate this model.

# Generate predictions using the model
data$predictions <- predict(model, newdata = data)

# Evaluate model performance
mse <- mean((data$popularity - data$predictions)^2)
r2 <- 1 - sum((data$popularity - data$predictions)^2) / sum((data$popularity - mean(data$popularity))^2)

# Print MSE and R-squared
cat("Mean Squared Error: ", mse, "\n")
Mean Squared Error:  580.7908 
cat("R-squared: ", r2, "\n")
R-squared:  0.02035224 
# Print model coefficients
cat("Model Coefficients:\n")
Model Coefficients:
print(coef(model))
 (Intercept)      valence        tempo       energy danceability 
 47.41560784   1.07353001   0.03339058 -19.27493809  -2.40258612 

The Mean Squared Error (MSE) of the model is 580.79, indicating a high average error between predicted and actual popularity values. The R-squared value is 0.0203, meaning the model explains only 2% of the variability in popularity, showing poor predictive performance. The coefficients indicate that valence and tempo have a slight positive impact, while energy and danceability negatively affect popularity.

plot(data$predictions, data$popularity - data$predictions,
     xlab = "Predicted Popularity", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

The residuals are spread widely and randomly around the horizontal line (at 0), but there is a visible funnel shape — meaning the spread of residuals increases as predicted popularity increases.

At the very least, use the 5 diagnostic plots discussed in class to identify any issues with your model.

For each plot, point out any indications of issues with the model. Otherwise, explain how the plot supports the claim that an assumption is met.

Try to measure the severity of any issues as well as the level of confidence you have in an assumption being met.

1 Plot Residuals vs Fitted Values

plot(model, which = 1)

This “Residuals vs Fitted” plot evaluates the linear regression model’s assumptions. The residuals (errors) should ideally be randomly scattered around the horizontal line at zero, indicating no patterns or bias. However, the funnel shape suggests heteroscedasticity (non-constant variance), which may require model improvement or transformation of variables.

2. Plot Residuals vs X values

plot(model, which = 2)

This Q-Q Residuals plot compares the standardized residuals of the regression model to a normal distribution. The points mostly follow the diagonal line, but deviations at the tails (both ends) suggest that the residuals are not perfectly normally distributed.

3. Scale-Location Plot

plot(model, which = 3)

This Scale-Location plot is used to assess the assumption of homoscedasticity (constant variance) in a linear regression model. The x-axis represents fitted values, while the y-axis shows the square root of standardized residuals. Ideally, the red line should be horizontal, and the points should be evenly scattered; however, the increasing spread suggests growth, indicating non-constant variance in residuals.

4. Cook’s Distance

# Cook's Distance by Observation
plot(model, which = 4)

This Cook’s Distance plot identifies observations that have a significant influence on the regression model’s fitted values. Observations 473, 3236, and 3667 have the highest Cook’s Distance values, indicating they exert disproportionate leverage on the model. These points should be investigated further as they are likely outliers or influential data points affecting the stability of the regression results.

Finally,

While performing the analysis, I realized the importance of granular aggregation and probabilistic grouping to capture hidden trends within the dataset. The key takeaway was that a structured approach to grouping can yield valuable insights into large datasets with multiple attributes, guiding decision-making for music streaming platforms and artists alike.

---
title: "Data Dive - 9"
output: html_notebook
---

```{r}
library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(conflicted)
```

```{r}
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
```

# Refer to the simple linear regression model you built last week. Include 1-3 more variables into your regression model.

```{r}
# Fit the multiple linear regression model
lm_model <- lm(popularity ~ danceability + energy + valence, data = data)

# Display model summary
summary(lm_model)

```

### **Interpretation of the Multiple Linear Regression Output**
Here, we fitted a multiple linear regression model with **popularity** as the dependent variable and **danceability, energy, and valence** as independent variables.

### **1. Model Equation**
The regression equation based on the coefficients:
\[
\text{popularity} = 51.3749 - 2.4805 \cdot \text{danceability} - 18.9689 \cdot \text{energy} + 0.6928 \cdot \text{valence}
\]

### **2. Coefficient Analysis**
- **Intercept (51.3749, p < 2e-16)**
  - When all predictors are 0, the predicted popularity is **51.37**.
  - The intercept is statistically significant (**p-value < 0.001***).

- **Danceability (-2.4805, p = 0.165)**
  - Has a negative effect on popularity, but **not statistically significant** (p-value > 0.05).
  - This means that **danceability does not have a strong impact on popularity** in the data.

- **Energy (-18.9689, p < 2e-16**)**
  - Has a strong **negative** impact on popularity.
  - **Statistically significant** (p-value < 0.001***), meaning higher energy is associated with **lower** popularity.

- **Valence (0.6928, p = 0.574)**
  - Very weak positive effect on popularity.
  - **Not statistically significant** (p-value > 0.05), meaning valence **does not strongly influence popularity**.

### **3. Model Performance**
- **Multiple R-squared = 0.01943 (1.94%)**
  - The model explains **only ~1.94%** of the variability in popularity.
  - This means the model is **not a good predictor** of popularity.

- **Adjusted R-squared = 0.01911**
  - Adjusts for the number of predictors but is still very low.

- **F-statistic = 59.43, p-value < 2.2e-16**
  - The model as a whole is statistically significant, meaning at least one predictor has an effect.

### **Key Takeaways**
1. **Energy is the only significant predictor** (p < 0.001) and has a **strong negative effect** on popularity.
2. **Danceability and valence are not significant**, meaning they do not reliably predict popularity.
3. **The model has very low predictive power (R² = 1.94%)**, indicating that **other factors influence song popularity**.

```{r}

data|> 
  ggplot(mapping = aes(x = energy, y = popularity, color = danceability)) + 
  geom_point(size = 0.5) + 
  geom_smooth(method = 'lm', se = FALSE, color = 'Orange') + 
  geom_hline(yintercept = mean(data$popularity, na.rm = TRUE), linetype = 'dashed') + 
  theme_minimal()

```

The orange regression line slopes downward, indicating a negative correlation between energy and popularity. This suggests that songs with higher energy tend to have lower popularity, and vice versa.

# Try out either an interaction term or a binary term to start.

# Adding an interaction term
```{r}
lm_model_interaction <- lm(popularity ~ danceability * energy + valence, data = data)
summary(lm_model_interaction)

```
energy is the only statistically significant predictor (p-value = 4.93e-05), indicating a strong relationship with popularity. danceability, valence, and the interaction term danceability:energy have high p-values (> 0.05), meaning their effects are not statistically significant.We have to include valence as it has a significant relationship with the target variable and enhances model performance. It can be excluded if it introduces multicollinearity with other features or doesn’t improve predictive power.


# Consider adding other integer or continuous variables.
# Adding Another Continuous Variable: Tempo
```{r}

lm_model_tempo <- lm(popularity ~ danceability + energy + tempo, data = data)
summary(lm_model_tempo)

```
The model is statistically significant overall (based on F-statistic).

Among predictors:

energy has a strong negative association with popularity and is highly significant.

tempo has a positive association with popularity and is also significant.

danceability does not have a statistically significant effect on popularity.

The low R-squared value suggests that this model explains very little of the variability in popularity, indicating that other factors may better predict popularity. **Tempo** is to be included in a model as it influences the target variable, such as song popularity, user engagement, or mood-based playlists, as it directly affects the song's energy and rhythm. However, we can exclude it if it is highly correlated with other features like **energy** or **danceability**, leading to multicollinearity or redundancy.



# Your model for this data dive should have 2-4 terms.
Yes, for this data dive, my model has 4 terms- 

Valence: Captures the mood or emotional content of a song (positive/negative).

Tempo: Represents the speed of the song, influencing energy and rhythm.

Energy (optional): Reflects the intensity of a track, which could correlate with tempo and valence.

Danceability (optional): Measures how suitable a song is for dancing, which often correlates with tempo and energy.










```{r}
# Define the model with Valence, Tempo, Energy, and Danceability as predictors
model <- lm(popularity ~ valence + tempo + energy + danceability, data = data)

# View the summary of the model
summary(model)

```
The model suggests that **tempo** and **energy** are significant predictors of **popularity**, with **tempo** having a positive impact and **energy** a negative one. **Valence** and **danceability** are not statistically significant predictors. The model explains only **2.14%** of the variance in popularity (R-squared = 0.02141), indicating a weak fit. Despite this, the overall model is statistically significant (p-value < 2.2e-16).


# Evaluate this model. 
```{r}
# Generate predictions using the model
data$predictions <- predict(model, newdata = data)

# Evaluate model performance
mse <- mean((data$popularity - data$predictions)^2)
r2 <- 1 - sum((data$popularity - data$predictions)^2) / sum((data$popularity - mean(data$popularity))^2)

# Print MSE and R-squared
cat("Mean Squared Error: ", mse, "\n")
cat("R-squared: ", r2, "\n")

# Print model coefficients
cat("Model Coefficients:\n")
print(coef(model))

```

The Mean Squared Error (MSE) of the model is 580.79, indicating a high average error between predicted and actual popularity values. The R-squared value is 0.0203, meaning the model explains only 2% of the variability in popularity, showing poor predictive performance. The coefficients indicate that valence and tempo have a slight positive impact, while energy and danceability negatively affect popularity.

```{r}
plot(data$predictions, data$popularity - data$predictions,
     xlab = "Predicted Popularity", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")
```

The residuals are spread widely and randomly around the horizontal line (at 0), but there is a visible funnel shape — meaning the spread of residuals increases as predicted popularity increases.



# At the very least, use the 5 diagnostic plots discussed in class to identify any issues with your model.
# For each plot, point out any indications of issues with the model. Otherwise, explain how the plot supports the claim that an assumption is met.
# Try to measure the severity of any issues as well as the level of confidence you have in an assumption being met.


# 1 Plot Residuals vs Fitted Values
```{r}
plot(model, which = 1)
```
This "Residuals vs Fitted" plot evaluates the linear regression model's assumptions. The residuals (errors) should ideally be randomly scattered around the horizontal line at zero, indicating no patterns or bias. However, the funnel shape suggests heteroscedasticity (non-constant variance), which may require model improvement or transformation of variables.

# 2. Plot Residuals vs X values 

```{r}
plot(model, which = 2)
```
This Q-Q Residuals plot compares the standardized residuals of the regression model to a normal distribution. The points mostly follow the diagonal line, but deviations at the tails (both ends) suggest that the residuals are not perfectly normally distributed.

# 3. Scale-Location Plot
```{r}
plot(model, which = 3)
```
This **Scale-Location plot** is used to assess the assumption of homoscedasticity (constant variance) in a linear regression model. The x-axis represents fitted values, while the y-axis shows the square root of standardized residuals. Ideally, the red line should be horizontal, and the points should be evenly scattered; however, the increasing spread suggests growth, indicating non-constant variance in residuals.

# 4. Cook's Distance
```{r}
# Cook's Distance by Observation
plot(model, which = 4)
```

This **Cook's Distance plot** identifies observations that have a significant influence on the regression model's fitted values. Observations 473, 3236, and 3667 have the highest Cook's Distance values, indicating they exert disproportionate leverage on the model. These points should be investigated further as they are likely outliers or influential data points affecting the stability of the regression results.

# Finally,

While performing the analysis, I realized the importance of granular aggregation and probabilistic grouping to capture hidden trends within the dataset. The key takeaway was that a structured approach to grouping can yield valuable insights into large datasets with multiple attributes, guiding decision-making for music streaming platforms and artists alike.





