library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
library(pwr) # For power analysis
Warning: package ‘pwr’ was built under R version 4.4.3
library(boot) # For bootstrapping
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
Select a continuous (or ordered integer) column of data that seems
most “valuable” given the context of your data, and call this your
response variable.
# Summary statistics for popularity
summary(data$popularity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 20.0 38.0 36.5 56.0 98.0
# Check unique values to confirm it's numeric/ordered
unique(data$popularity)
[1] 50 49 78 15 56 28 34 17 74 71 70 76 68 60 1 29 51 21 52 46 0 24 48 27 45 63 25 69 66 37 55 53 67 65 40 33 22 61 54 43 16
[42] 20 23 30 79 8 58 59 47 26 39 12 41 62 18 36 72 64 38 57 44 32 35 19 6 31 42 77 82 83 2 94 75 9 4 81 87 10 13 80 96 3
[83] 84 73 91 7 14 89 85 5 86 11 97 93 90 88 98 92
The minimum popularity score is 0, and the maximum is 98, indicating
a wide range. The median (38.0) and mean (36.5) suggest a slightly
right-skewed distribution. The first quartile (Q1 = 20.0) and third
quartile (Q3 = 56.0) indicate that 50% of the tracks have popularity
scores between these values.
The unique(data$popularity) function confirms that popularity is an
ordered numeric variable. The presence of values across the full range
(0–98) suggests that popularity is not categorical but a continuous (or
ordinal integer) variable.
Visual - 1
# Histogram of Popularity
ggplot(data, aes(x = popularity)) +
geom_histogram(binwidth = 5, fill = "blue", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Track Popularity", x = "Popularity", y = "Count")

From the graph, 1. Many tracks have 0 popularity,
indicating low or no engagement, possibly due to niche artists or newly
added songs.
2. The distribution is right-skewed, with most songs
having moderate popularity (10-50) and only a few reaching high
popularity.
3. A bimodal trend is observed, with peaks at 0
and 25-30 popularity, suggesting two distinct groups of
songs.
4. Popularity declines gradually beyond 50, showing
that mainstream success is rare and highly
competitive.
Checking how popularity correlates with other numerical
features.
# Select numeric columns only
numeric_data <- data %>% select_if(is.numeric)
# Compute correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
# Show correlation of popularity with other features
cor_matrix["popularity", ]
X popularity duration_ms danceability energy key loudness
0.069695956 1.000000000 -0.024994263 0.043700311 -0.138849719 -0.033808285 -0.038545986
mode speechiness acousticness instrumentalness liveness valence tempo
-0.020201275 -0.117183177 0.030176371 -0.084582897 -0.121186883 0.001573949 0.024953068
time_signature
0.042263188
Popularity has no strong correlation with any
numerical feature, though speechiness (-0.117) and liveness
(-0.121) show slight negative effects, while
danceability (0.044) and energy (0.044) have weak positive
impacts.
Visual - 2
ggplot(data, aes(x = danceability, y = popularity)) +
geom_point(alpha = 0.5, color = "red") +
theme_minimal() +
labs(title = "Danceability vs Popularity", x = "Danceability", y = "Popularity")

The scatter plot shows a weak positive correlation
between danceability and popularity, meaning tracks with higher
danceability tend to be more popular but not strongly.
The points are widely dispersed, suggesting popularity is
influenced by multiple factors beyond danceability.
Select a categorical column of data (explanatory variable) that you
expect might influence the response variable.
We need a categorical variable that might influence popularity. A
good choice is “key” (musical key of the track), since different musical
keys might affect listener preferences.
# Check unique values in the key column
unique(data$key)
[1] 8 5 0 7 11 4 9 3 1 10 6 2
# Count occurrences of each key to ensure we have enough data
table(data$key)
0 1 2 3 4 5 6 7 8 9 10 11
770 1283 810 288 669 550 776 973 660 763 597 861
# If too many categories, consolidate into Major and Minor keys
data$key_grouped <- ifelse(data$mode == 1, "Major", "Minor")
# Check the new grouping
table(data$key_grouped)
Major Minor
5210 3790
The musical key is a good categorical variable to
analyze its impact on popularity. Since there are 12 distinct
keys, we can simplify by grouping them into Major and
Minor using the mode
column. The dataset has
5,210 Major key tracks and 3,790 Minor key
tracks, ensuring enough data for comparison. Analyzing
popularity across these groups can reveal if listeners prefer
certain keys more.
Devise a null hypothesis for an ANOVA test given this situation.
Test this hypothesis using ANOVA, and summarize your results (e.g., use
box plots). Be clear about how the R output relates to your
conclusions.
For an ANOVA test, we compare means of popularity across different
keys.
# Run ANOVA test
anova_result <- aov(popularity ~ key_grouped, data = data)
# Print ANOVA summary
summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
key_grouped 1 2167 2168 3.674 0.0553 .
Residuals 8998 5309050 590
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Null Hypothesis (H₀): The mean popularity of tracks
does not significantly differ between Major and Minor
keys.
- Alternative Hypothesis (H₁): There
is a significant difference in popularity between Major
and Minor keys.
The ANOVA test output shows an F-value of
3.674 with a p-value of 0.0553. Since
p > 0.05, we fail to reject H₀, meaning
there is no strong evidence that key type (Major vs. Minor)
influences popularity significantly. However, since the p-value
is close to 0.05, there might be a weak trend worth further
exploration.
Interpretation of ANOVA Results:
The ANOVA test evaluates whether there is a significant difference in
popularity between Major and Minor
keys. The p-value of 0.0864 is above the
conventional 5% significance level but below
10%, indicating weak evidence of an effect.
Conclusion:
Since the p-value is greater than 0.05, there is
no strong statistical evidence that key type (Major or
Minor) influences a track’s popularity. However, at a 10%
significance level, the results suggest a slight trend, though
not conclusive. Key type alone does not appear to be a major factor in
determining a song’s popularity. Other elements, such as genre,
lyrics, tempo, or artist recognition, likely have a stronger
influence on listener preferences.
# Box plot of Popularity vs Key Type
ggplot(data, aes(x = key_grouped, y = popularity, fill = key_grouped)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Popularity by Key Type", x = "Key Type", y = "Popularity") +
scale_fill_manual(values = c("Major" = "blue", "Minor" = "red"))

The box plot visualizes the distribution of
popularity scores for songs in Major (blue) and
Minor (red) keys. The median popularity is nearly the same for
both groups, with a similar interquartile range and spread, indicating
that key type (Major or Minor) does not strongly influence
popularity.
Find a single continuous (or ordered integer, non-binary) column of
data that might influence the response variable. Make sure the
relationship between this variable and the response is roughly
linear.
# Check structure of danceability
str(data$danceability)
num [1:9000] 0.907 0.382 0.756 0.223 0.737 0.787 0.259 0.309 0.889 0.49 ...
# Summary statistics
summary(data$danceability)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0647 0.5230 0.6580 0.6369 0.7720 0.9710
# Check correlation with popularity
cor(data$popularity, data$danceability, use="complete.obs")
[1] 0.04370031
Interpretation of Analysis on Danceability and
Popularity
- Danceability is a continuous variable ranging from
0.0647 to 0.9710, with a median of
0.658.
- The correlation between danceability and popularity
is 0.0437, indicating a very weak positive
relationship.
- Since the correlation is close to zero,
danceability does not strongly influence popularity,
meaning higher danceability does not necessarily lead to more popular
songs.
Step 2: Check for Linearity Before fitting a linear regression model,
we visualize the relationship.
# Scatter plot with trend line
ggplot(data, aes(x = danceability, y = popularity)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", col = "blue") +
theme_minimal() +
labs(title = "Danceability vs. Popularity",
x = "Danceability",
y = "Popularity")

Interpretation of the Scatter Plot: Danceability
vs. Popularity
- Each dot represents a song, plotting its
danceability (x-axis) against
popularity (y-axis).
- The blue line represents the linear
trend fitted using a regression model.
- The line is nearly flat, indicating a very
weak positive correlation, which aligns with the calculated
correlation of 0.0437.
- The spread of points shows no clear linear pattern,
suggesting danceability is not a strong predictor of
popularity.
# Fit the model
lm_model <- lm(popularity ~ danceability, data = data)
# Display model summary
summary(lm_model)
Call:
lm(formula = popularity ~ danceability, data = data)
Residuals:
Min 1Q Median 3Q Max
-38.46 -16.05 0.78 19.26 61.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.5739 0.9796 33.253 < 2e-16 ***
danceability 6.1601 1.4846 4.149 3.37e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24.27 on 8998 degrees of freedom
Multiple R-squared: 0.00191, Adjusted R-squared: 0.001799
F-statistic: 17.22 on 1 and 8998 DF, p-value: 3.366e-05
The linear model shows that danceability has a small but
statistically significant effect on popularity (\(p < 0.001\)). However, the R²
value (0.00191) is extremely low, meaning danceability explains
only 0.19% of the variation in popularity. The
coefficient (6.16) suggests that increasing danceability by 1 unit
raises popularity by 6.16 points on average. However,
the high residual standard error (24.27) indicates
large unexplained variability. This suggests that danceability alone is
not a strong predictor, and other factors should be considered for
better modeling.
Residual Analysis
# Plot residuals to check model assumptions
par(mfrow = c(2,2))
plot(lm_model)

The residual analysis plots assess the assumptions of linear
regression:
- Residuals vs Fitted: Shows a random
scatter around zero, but some clustering indicates potential
non-linearity or heteroscedasticity.
- Q-Q Plot: Residuals deviate from the normal
line, suggesting non-normality in
errors.
- Scale-Location: Residuals appear evenly
spread, but the slight upward trend suggests possible
heteroscedasticity.
- Residuals vs Leverage: No extreme leverage points,
but a few observations may influence the model.
Overall, the plots indicate violations of linear regression
assumptions, suggesting a more complex model might be needed. A
multiple regression model with more features (e.g., tempo, energy,
loudness, speechiness) is needed to better understand popularity.
