Week 8

R Markdown

data <- read.csv("/Users/yashuvaishu/Downloads/Spotify.csv")

Here I am considering danceability as continuous and genre as categorical.

Responsible variable = danceability

Explanatory variable = genre

Hypothesis

Null Hypothesis (H0): There is no significant difference in danceability scores across different music genres.

Alternative Hypothesis (H1): There is a significant difference in danceability scores across different music genres

I am Considering alpha level here as 0.05

alpha <- 0.05

# ANOVA test
anova_result <- aov(danceability ~ genre,data)

# Summary of ANOVA test
summary(anova_result)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## genre        522  86.74 0.16618   10.43 <2e-16 ***
## Residuals   7988 127.28 0.01593                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As P value is very less than the significance level we can reject null hypothesis (H0) and Accept alternative hypothesis(H1).

There is enough evidence to conclude that there is a significant difference in danceability scores among different music genres.

Listeners can expect distinct differences in danceability based on the genre of music they choose.

model <- lm(danceability ~ energy, data=data)

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = danceability ~ energy, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55483 -0.09529  0.01383  0.10850  0.41053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.500300   0.004216  118.67   <2e-16 ***
## energy      0.178790   0.006842   26.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1526 on 8509 degrees of freedom
## Multiple R-squared:  0.07429,    Adjusted R-squared:  0.07418 
## F-statistic: 682.9 on 1 and 8509 DF,  p-value: < 2.2e-16

From Results by looking at coefficients section in the model summary.

The linear regression model output shows that both energy and the intercept are significant predictors of danceability (p < 2e-16), with an increase in energy positively associated with an increase in danceability. The coefficient estimate for energy is 0.178790, meaning that for every unit increase in energy, there is an expected increase in danceability by 0.178790 units. However, the adjusted R-squared value of 0.07418 suggests that the model explains only a small amount of the variance and there could be other factors that influence danceability which are not captured by the current model.

library(ggplot2)

ggplot(data, aes(x = energy, y = danceability)) + 
  geom_point(color = "lightblue") + 
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
              color = "red", size = 1.5) +
  labs(title = "Regression line of danceability by energy",
       x = "Energy",
       y = "Danceability")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Diagnostic plots for linear regression model as mentioned above

# plot of residuals vs. fitted values
plot(model, which = 1,col = "steelblue")

# normal Q-Q plot of residuals
plot(model, which = 2, col = "orange")
qqline(residuals(model),col = "blue")

# scale-location plot of residuals
plot(model, which = 3, col = "red")

# plot of Cook's distance
plot(model, which = 4, col = "purple")

Including other variable into the regression model

# multiple linear regression model with 'Energy' and 'Valence' 
multi_model <- lm(danceability ~ energy + valence, data=data)

# Summarize the multiple regression model
summary(multi_model)

## 
## Call:
## lm(formula = danceability ~ energy + valence, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45980 -0.08416  0.01060  0.09658  0.39618 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.445977   0.004015 111.082  < 2e-16 ***
## energy      0.045327   0.006912   6.558 5.79e-11 ***
## valence     0.298592   0.006878  43.413  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1381 on 8508 degrees of freedom
## Multiple R-squared:  0.2422, Adjusted R-squared:  0.242 
## F-statistic:  1359 on 2 and 8508 DF,  p-value: < 2.2e-16

Estimate:

The above linear regression model says that danceability using energy and valence as predictors. The output indicates that both energy and valence are significant predictors of danceability, with estimated regression coefficients of 0.045 and 0.299, respectively. Specifically, for each unit increase in energy, there is an expected increase of 0.045 units in danceability, while for each unit increase in valence, there is an expected increase of 0.299 units in danceability.

The adjusted R-squared of the model is 0.242, indicating that the predictors explain 24.2% of the variance in danceability.

* For adding additional variable into model, I am here considering the relation between danceability and energy, while controlling for the effect of tempo as a covariate. Basically tempo mean beats per minute so we can say tempo can also predict danceability.

model_two <- lm(danceability ~ energy + tempo, data)

# summary
summary(model_two)

## 
## Call:
## lm(formula = danceability ~ energy + tempo, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58469 -0.09469  0.01378  0.10781  0.41258 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.282e-01  7.430e-03  71.089  < 2e-16 ***
## energy       1.854e-01  6.986e-03  26.537  < 2e-16 ***
## tempo       -2.652e-04  5.828e-05  -4.551 5.42e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1524 on 8508 degrees of freedom
## Multiple R-squared:  0.07654,    Adjusted R-squared:  0.07632 
## F-statistic: 352.6 on 2 and 8508 DF,  p-value: < 2.2e-16

The results show that both energy (β=0.1854, p<0.001) and `tempo` (β=-0.00026, p<0.001) are significant predictors of danceability, indicating that higher energy and lower tempo are associated with higher danceability, after controlling for each other. The adjusted R-squared value of 0.07654 suggests that the model explains about 7.6% of the variance in danceability.

Now let us think what if energy might depend upon tempo I want to test my hypothesis now by using a term energy*tempo.

model_three <- lm(danceability ~ energy * tempo, data)

# summary
summary(model_three)

## 
## Call:
## lm(formula = danceability ~ energy * tempo, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52094 -0.09408  0.01245  0.10816  0.41084 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3464675  0.0153791   22.53   <2e-16 ***
## energy        0.5408261  0.0273157   19.80   <2e-16 ***
## tempo         0.0013324  0.0001320   10.09   <2e-16 ***
## energy:tempo -0.0030463  0.0002265  -13.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1508 on 8507 degrees of freedom
## Multiple R-squared:  0.09577,    Adjusted R-squared:  0.09545 
## F-statistic: 300.3 on 3 and 8507 DF,  p-value: < 2.2e-16

Based on the output of the model, we can conclude that the interaction term between energy and tempo is significant (p-value < 0.001). This suggests that the effect of energy on danceability depends on the level of tempo. Specifically, the negative coefficient estimate for the interaction term (-0.003) indicates that the effect of energy on danceability becomes weaker at higher levels of tempo.

However, it’s important to note that the adjusted R-squared of the model is relatively low (0.095), indicating that the predictors in the model do not explain a large proportion of the variation in danceability.