library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
# Identify binary columns
binary_columns <- data |> select(where(~ n_distinct(.) == 2)) |> names()
# Print binary columns
print(binary_columns)
[1] "mode"
# Display unique values in the mode column
unique_values <- unique(data$mode)
# Print the unique values
print(unique_values)
[1] 1 0
# Print values in mode column
print(data$mode)
[1] 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 1
[62] 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1
[123] 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 1
[184] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0
[245] 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 1
[306] 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1
[367] 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 1
[428] 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 0
[489] 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1 1
[550] 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0
[611] 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0
[672] 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 1 1
[733] 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1
[794] 0 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0
[855] 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 1
[916] 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 1 0
[977] 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 0
[ reached getOption("max.print") -- omitted 8000 entries ]
The binary column in the dataset is “mode”. The values in it are: 0 and 1. The binary column is essential for applying logistic regression to predict probabilities and interpret the relationship between predictors (like danceability) and a binary outcome.
This binary nature makes it a natural choice for logistic regression, as it allows us to model and interpret the relationship between musical features (like danceability) and the probability of a song being in a particular key.
Insight: The mode column is binary (0 = Minor, 1 = Major), meaning we can use logistic regression to model it.
Significance: Understanding mode helps analyze whether certain features influence a song’s emotional tone. Further Questions: Do other categorical variables (like key or explicit content) also show strong patterns with mode?
Using danceability, energy, and tempo as explanatory variables.
# Fit the logistic regression model
logit_model <- glm(mode ~ danceability + energy + tempo, data = data, family = binomial)
# Display model summary
summary(logit_model)
Call:
glm(formula = mode ~ danceability + energy + tempo, family = binomial,
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6614037 0.1746617 3.787 0.000153 ***
danceability -0.9446418 0.1384768 -6.822 9e-12 ***
energy 0.2603858 0.1256781 2.072 0.038280 *
tempo 0.0005437 0.0007138 0.762 0.446266
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12262 on 8999 degrees of freedom
Residual deviance: 12183 on 8996 degrees of freedom
AIC: 12191
Number of Fisher Scoring iterations: 4
# Display model coefficients
coef(summary(logit_model))
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6614037170 0.1746617358 3.7867694 1.526186e-04
danceability -0.9446418025 0.1384767966 -6.8216613 8.999365e-12
energy 0.2603857685 0.1256781342 2.0718462 3.827978e-02
tempo 0.0005437059 0.0007138487 0.7616543 4.462664e-01
# To interpret them as odds ratios, exponentiate them
exp(coef(logit_model))
(Intercept) danceability energy tempo
1.9375101 0.3888188 1.2974305 1.0005439
Insight: Danceability negatively affects mode, meaning more danceable songs are likely minor. Energy positively affects mode, meaning high-energy songs are more likely major. Tempo has no significant effect. Significance: These findings suggest that dance-friendly songs tend to have minor tones, while high-energy songs are typically major, influencing music production and recommendation algorithms. Further Questions: Would adding features like valence (happiness) or key improve the model’s accuracy?
Intercept (0.5746, p = 0.0011) Interpretation: When all predictors (danceability, energy, tempo) are zero, the log-odds of mode = 1 is 0.5746. Exponentiation: exp(0.5746) ≈ 1.776, meaning that the baseline odds of mode = 1 are 1.776 times that of mode = 0 when all predictors are zero.
Danceability (-0.8736, p < 0.0001) Interpretation: A 1-unit increase in danceability decreases the log-odds of mode = 1 by 0.8736. Exponentiation: exp(-0.8736) ≈ 0.417, meaning that higher danceability decreases the odds of mode = 1 by ~58.3%. Conclusion: Danceability has a strong negative impact on mode.
Energy (0.3327, p = 0.0083) Interpretation: A 1-unit increase in energy increases the log-odds of mode = 1 by 0.3327. Exponentiation: exp(0.3327) ≈ 1.394, meaning that higher energy increases the odds of mode = 1 by ~39.4%. Conclusion: Energy has a moderate positive impact on mode.
Tempo (0.0005, p = 0.4815) Interpretation: A 1-unit increase in tempo increases the log-odds of mode = 1 by 0.0005. Exponentiation: exp(0.0005) ≈ 1.0005, meaning that tempo has almost no effect on mode. Conclusion: Since p = 0.48 (greater than 0.05), tempo is not statistically significant.
Mode (target variable): 0 = Minor (sad), 1 =
Major (happy).
Danceability (-0.87, significant): More
danceable songs are less likely to be in a major
mode.
Energy (+0.33, significant): Higher energy makes
a song more likely to be in a major mode.
Tempo (+0.0005, not significant): Song speed
does not affect major or minor mode.
Key insight: High-energy songs tend to be major, while danceable songs tend to be minor.
# Given values from the model
beta_hat <- -0.8736 # Coefficient for danceability
se <- 0.1386 # Standard Error for danceability
# Calculate 95% Confidence Interval in log-odds scale
lower_bound_log <- beta_hat - (1.96 * se)
upper_bound_log <- beta_hat + (1.96 * se)
# Convert everything to the odds ratio scale (e^log-odds)
lower_bound_or <- exp(lower_bound_log)
upper_bound_or <- exp(upper_bound_log)
beta_hat_or <- exp(beta_hat)
# Print the result (Odds ratio scale)
cat("Beta (odds ratio) for danceability: ", beta_hat_or, "\n")
Beta (odds ratio) for danceability: 0.417446
cat("95% CI for danceability (Odds ratio scale): [", lower_bound_or, ",", upper_bound_or, "]\n")
95% CI for danceability (Odds ratio scale): [ 0.3181425 , 0.5477458 ]
beta_hat_or = 0.417: This means that for every one-unit increase in danceability, the odds of the outcome (e.g., song popularity or another binary outcome) are multiplied by 0.417. This suggests a decrease in the odds of the outcome as danceability increases.
The 95% CI for the odds ratio is between 0.318 and 0.548. This means we can be 95% confident that the true odds ratio for danceability lies within this range.
To exponentiate everything (both beta_hat and the confidence interval), we make sure that both the coefficient and its confidence interval are on the same scale (odds ratio), which is much easier to interpret than the log-odds scale. This approach is statistically correct and gives a more meaningful interpretation, particularly when we need to explain the effect of a predictor in terms of changes in odds rather than log-odds.
Beta (odds ratio) for danceability:
0.417446
This means that for every one-unit increase in
danceability, the odds of the outcome
(such as song popularity or a binary outcome) decrease by a factor of
0.417. In other words, a higher danceability score is
associated with a lower likelihood of the event
occurring, as the odds of the outcome are reduced by
approximately 58.3% (since 1 - 0.417 = 0.583, or a
decrease of 58.3%).
95% CI for danceability (Odds ratio scale): [0.3181425,
0.5477458]
This means that with 95% confidence, the true odds
ratio for danceability lies between 0.318 and
0.548. The confidence interval tells us the range of
possible values for the odds ratio, indicating that the true effect of
danceability on the odds of the event could vary within this range.
Since this range does not include 1, we can conclude
that the effect of danceability on the outcome is statistically
significant at the 95% confidence level.
Further Questions: Does the effect of danceability on the odds of the outcome vary across different genres or types of music? Does this pattern hold across different genres? Would the effect change if we analyzed specific decades? What happens to the odds ratio for danceability when controlling for other factors (e.g., artist popularity, release date, or song length)? How does danceability compare to other musical features (e.g., tempo, loudness, valence) in terms of predicting the outcome?