Binary Column:
Here I am considering mode as a binary column which consists of 1’s and 0’s. Song with mode 1 is said to be major song and if song is non-major or minor than mode of that song is 0.
data1 = read.csv("/Users/yashuvaishu/Downloads/Spotify1.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
model1 <- glm(mode ~ valence + danceability + energy + loudness, data = data1, family = binomial)
summary(model1)
##
## Call:
## glm(formula = mode ~ valence + danceability + energy + loudness,
## family = binomial, data = data1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.411559 0.177009 13.624 < 2e-16 ***
## valence 0.682212 0.114492 5.959 2.54e-09 ***
## danceability -1.090562 0.167240 -6.521 6.99e-11 ***
## energy -1.778491 0.161581 -11.007 < 2e-16 ***
## loudness 0.065002 0.007071 9.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11329 on 8511 degrees of freedom
## Residual deviance: 11174 on 8507 degrees of freedom
## AIC: 11184
##
## Number of Fisher Scoring iterations: 4
From the above summary we can say that:
The intercept of 2.411559 is the estimated log-odds of mode when all the predictor variables (valence, danceability, energy and loudness) are zero. This value is statistically significant at p < 0.001, indicating that there is a significant difference in the log-odds of the outcome between the two groups (mode=0, mode=1).
If there is one unit increase in valence, the log-odds of mode increase by 0.682212, holding all other variables constant. This estimate is also statistically significant at p < 0.001.
If there is one unit increase in danceability, the log-odds of mode decrease by 1.090562, holding all other variables constant. This estimate is also statistically significant at p < 0.001.
If there is one unit increase in energy, the log-odds of mode decrease by 1.778491, holding all other variables constant. This estimate is also statistically significant at p < 0.001.
If there is one unit increase in loudness, the log-odds of mode increase by 0.065002, holding all other variables constant. This estimate is also statistically significant at p < 0.001.
First I am finding the confidence interval with out using standard error.
confint(model1,parm = "valence")
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.4581545 0.9070005
Here I got the 95% Confidence Interval for Valence Coefficient as( 0.4581545, 0.9070005)
Now I am using Standard error to find the Confidence Interval Let us see what happens
val <- sqrt(diag(vcov(model1))["valence"])
# Calculating the confidence interval
lower_valence <- coef(model1)["valence"] - 1.96 * val
upper_valence <- coef(model1)["valence"] + 1.96 * val
cat("95% Confidence Interval for Valence Coefficient: (", lower_valence, ", ", upper_valence, ")\n")
## 95% Confidence Interval for Valence Coefficient: ( 0.4578082 , 0.9066163 )
So here we can see we have slight variation in the lower and upper bound for Confidence Interval.
model1 <- lm(danceability ~ valence,
filter(data1, mode == 1))
rsquared <- summary(model1)$r.squared
data1 |>
filter( mode == 1 ) |>
ggplot(mapping = aes(x = energy,
y = danceability)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red', linetype = 'dashed',
se = FALSE) +
geom_smooth(se = FALSE) +
labs(title = "energy vs danceability ",
subtitle = paste("Linear Fit R-Squared =")) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
model1 <- lm(danceability ~ loudness,
filter(data1, mode == 1))
rsquared <- summary(model1)$r.squared
data1 |>
filter( mode == 1 ) |>
ggplot(mapping = aes(x = loudness,
y = danceability)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red', linetype = 'dashed',
se = FALSE) +
geom_smooth(se = FALSE) +
labs(title = "Loudness vs danceability",
subtitle = paste("Linear Fit R-Squared =")) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For above two Graph I almost got Linear graph so for me here there is no need to transform my data variables.