R Markdown

Binary Column:

Here I am considering mode as a binary column which consists of 1’s and 0’s. Song with mode 1 is said to be major song and if song is non-major or minor than mode of that song is 0.

data1 = read.csv("/Users/yashuvaishu/Downloads/Spotify1.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
model1 <- glm(mode ~ valence + danceability + energy + loudness, data = data1, family = binomial)
summary(model1)
## 
## Call:
## glm(formula = mode ~ valence + danceability + energy + loudness, 
##     family = binomial, data = data1)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.411559   0.177009  13.624  < 2e-16 ***
## valence       0.682212   0.114492   5.959 2.54e-09 ***
## danceability -1.090562   0.167240  -6.521 6.99e-11 ***
## energy       -1.778491   0.161581 -11.007  < 2e-16 ***
## loudness      0.065002   0.007071   9.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11329  on 8511  degrees of freedom
## Residual deviance: 11174  on 8507  degrees of freedom
## AIC: 11184
## 
## Number of Fisher Scoring iterations: 4

From the above summary we can say that:

The intercept of 2.411559 is the estimated log-odds of mode when all the predictor variables (valence, danceability, energy and loudness) are zero. This value is statistically significant at p < 0.001, indicating that there is a significant difference in the log-odds of the outcome between the two groups (mode=0, mode=1).

If there is one unit increase in valence, the log-odds of mode increase by 0.682212, holding all other variables constant. This estimate is also statistically significant at p < 0.001.

If there is one unit increase in danceability, the log-odds of mode decrease by 1.090562, holding all other variables constant. This estimate is also statistically significant at p < 0.001.

If there is one unit increase in energy, the log-odds of mode decrease by 1.778491, holding all other variables constant. This estimate is also statistically significant at p < 0.001.

If there is one unit increase in loudness, the log-odds of mode increase by 0.065002, holding all other variables constant. This estimate is also statistically significant at p < 0.001.

First I am finding the confidence interval with out using standard error.

confint(model1,parm = "valence")
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.4581545 0.9070005

Here I got the 95% Confidence Interval for Valence Coefficient as( 0.4581545, 0.9070005)

Now I am using Standard error to find the Confidence Interval Let us see what happens

val <- sqrt(diag(vcov(model1))["valence"])

# Calculating the confidence interval
lower_valence <- coef(model1)["valence"] - 1.96 * val
upper_valence <- coef(model1)["valence"] + 1.96 * val

cat("95% Confidence Interval for Valence Coefficient: (", lower_valence, ", ", upper_valence, ")\n")
## 95% Confidence Interval for Valence Coefficient: ( 0.4578082 ,  0.9066163 )

So here we can see we have slight variation in the lower and upper bound for Confidence Interval.

model1 <- lm(danceability ~ valence,
            filter(data1, mode == 1))

rsquared <- summary(model1)$r.squared

data1 |> 
  filter( mode == 1  ) |>
  ggplot(mapping = aes(x = energy, 
                       y = danceability)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "energy vs danceability ",
       subtitle = paste("Linear Fit R-Squared =")) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

model1 <- lm(danceability ~ loudness,
            filter(data1, mode == 1))

rsquared <- summary(model1)$r.squared

data1 |> 
  filter( mode == 1  ) |>
  ggplot(mapping = aes(x = loudness, 
                       y = danceability)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "Loudness vs danceability",
       subtitle = paste("Linear Fit R-Squared =")) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For above two Graph I almost got Linear graph so for me here there is no need to transform my data variables.