library(ggplot2)
Keep up to date with changes at https://tidyverse.org/blog/
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
✔ readr 2.1.5 ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
Build a linear (or generalized linear) model as you like.Use
whatever response variable and explanatory variables you prefer.
model <- lm(popularity ~ danceability + energy + acousticness + valence + tempo + loudness + duration_ms, data = data)
summary(model)
Call:
lm(formula = popularity ~ danceability + energy + acousticness +
valence + tempo + loudness + duration_ms, data = data)
Residuals:
Min 1Q Median 3Q Max
-47.20 -15.57 0.13 18.71 65.83
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.990e+01 3.069e+00 19.521 < 2e-16 ***
danceability -5.341e+00 1.863e+00 -2.867 0.004158 **
energy -2.727e+01 2.067e+00 -13.189 < 2e-16 ***
acousticness -6.563e-01 1.144e+00 -0.574 0.566073
valence 7.031e-01 1.243e+00 0.566 0.571564
tempo 3.037e-02 8.651e-03 3.511 0.000449 ***
loudness 6.289e-01 1.144e-01 5.498 3.96e-08 ***
duration_ms -1.228e-06 3.014e-06 -0.407 0.683693
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24 on 8992 degrees of freedom
Multiple R-squared: 0.02459, Adjusted R-squared: 0.02383
F-statistic: 32.38 on 7 and 8992 DF, p-value: < 2.2e-16
Use the tools from previous weeks to diagnose the model.
1.Residual Histogram
# Assuming your linear model is called 'lm_model'
# Calculate residuals
residuals <- residuals(lm_model)
# Plot histogram of residuals
hist(residuals, main="Histogram of Residuals", xlab="Residuals", col="lightblue", border="black")

The histogram displays the distribution of residuals from the model,
showing their frequency across different ranges. It suggests that the
residuals are approximately symmetric but may not perfectly follow a
normal distribution.
2. Residuals vs Fitted Values Plot
# Plot residuals vs fitted values
plot(fitted(lm_model), residuals, main="Residuals vs Fitted Values", xlab="Fitted Values", ylab="Residuals", pch=20)
abline(h=0, col="red", lwd=2) # Add a horizontal line at 0

The plot shows residuals versus fitted values from a linear
regression model. Residuals are scattered randomly around the red line
at zero, indicating no obvious pattern and suggesting the model fits the
data reasonably well.
3. QQ-Plot
# Generate QQ-Plot
qqnorm(residuals)
qqline(residuals, col="red", lwd=2)

The Normal Q-Q plot compares the residuals of a model to a
theoretical normal distribution. Most points align with the red diagonal
line, indicating approximate normality, but deviations at the tails
suggest potential outliers or non-normal behavior. This may impact the
model’s assumptions and require further diagnostics.
4. Cook’s Distance by Observation
# Calculate Cook's Distance
cooks_d <- cooks.distance(lm_model)
# Plot Cook's Distance by observation
plot(cooks_d, type="h", main="Cook's Distance by Observation", xlab="Observation", ylab="Cook's Distance", col="blue")
abline(h=4 / length(cooks_d), col="red", lwd=2) # Threshold line

The graph represents Cook’s Distance for each observation in a
dataset, which measures the influence of individual observations on the
regression model.Most observations have very low Cook’s Distance (below
the red threshold line), indicating they have minimal impact on the
model’s coefficients. A few spikes above the red threshold line suggest
influential observations that could disproportionately affect the
model’s results.
Highlight any issues with the model.
Key Model Issues Identified:
Non-Normal Residuals: Q-Q plot deviations at the tails and histogram
asymmetry suggest residuals may violate normality assumptions.
Potential Heteroscedasticity: Slight funnel-shaped pattern in
residuals vs. fitted implies non-constant variance, risking biased error
estimates.
Influential Outliers: High Cook’s Distance values and outlier
residuals indicate data points disproportionately affecting model
accuracy.
Interpret at least one of the coefficients.
Let’s interpret the “danceability” coefficient from the model:
Coefficient for Danceability: -4.742e+00 (or -4.742)
Interpretation: For every one-unit increase in danceability, the
dependent variable (e.g., popularity) is expected to decrease by 4.742
units, holding all other variables constant.
Significance: The p-value for danceability is 0.010899, which is less
than 0.05, indicating that the relationship between danceability and the
dependent variable is statistically significant at the 5% level.
Context: A negative coefficient means that as the danceability of a
track increases, the outcome variable decreases, suggesting that higher
danceability may negatively influence the target variable(popularity) in
this model.
LS0tDQp0aXRsZTogIkRhdGEgRGl2ZSAtIDExIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KDQpgYGB7cn0NCmxpYnJhcnkoZ2dwbG90MikNCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmBgYA0KDQoNCmBgYHtyfQ0KI1JlYWRpbmcgdGhlIGRhdGEgc2V0DQpkYXRhIDwtIHJlYWQuY3N2KCJkYXRhc2V0LmNzdiIpDQpjb25mbGljdGVkOjpjb25mbGljdHNfcHJlZmVyKGRwbHlyOjpmaWx0ZXIpDQojIEZpbHRlcmluZyBkYXRhc2V0IHdoZXJlIGV4cGxpY2l0IGlzICJUcnVlIiBhbmQgdGFraW5nIGEgc2FtcGxlIG9mIDksMDAwIHJvd3MNCnNhbXBsZV9kYXRhIDwtIGRhdGEgfD4gZmlsdGVyKGV4cGxpY2l0ID09ICJUcnVlIikgfD4gc2FtcGxlX24oOTAwMCkNCmRhdGEgPC0gc2FtcGxlX2RhdGENCmRhdGENCmBgYA0KDQojIEJ1aWxkIGEgbGluZWFyIChvciBnZW5lcmFsaXplZCBsaW5lYXIpIG1vZGVsIGFzIHlvdSBsaWtlLlVzZSB3aGF0ZXZlciByZXNwb25zZSB2YXJpYWJsZSBhbmQgZXhwbGFuYXRvcnkgdmFyaWFibGVzIHlvdSBwcmVmZXIuDQoNCmBgYHtyfQ0KbW9kZWwgPC0gbG0ocG9wdWxhcml0eSB+IGRhbmNlYWJpbGl0eSArIGVuZXJneSArIGFjb3VzdGljbmVzcyArIHZhbGVuY2UgKyB0ZW1wbyArIGxvdWRuZXNzICsgZHVyYXRpb25fbXMsIGRhdGEgPSBkYXRhKQ0Kc3VtbWFyeShtb2RlbCkNCmBgYA0KDQojIFVzZSB0aGUgdG9vbHMgZnJvbSBwcmV2aW91cyB3ZWVrcyB0byBkaWFnbm9zZSB0aGUgbW9kZWwuDQojIDEuUmVzaWR1YWwgSGlzdG9ncmFtDQoNCmBgYHtyfQ0KIyBBc3N1bWluZyB5b3VyIGxpbmVhciBtb2RlbCBpcyBjYWxsZWQgJ2xtX21vZGVsJw0KIyBDYWxjdWxhdGUgcmVzaWR1YWxzDQpyZXNpZHVhbHMgPC0gcmVzaWR1YWxzKGxtX21vZGVsKQ0KDQojIFBsb3QgaGlzdG9ncmFtIG9mIHJlc2lkdWFscw0KaGlzdChyZXNpZHVhbHMsIG1haW49Ikhpc3RvZ3JhbSBvZiBSZXNpZHVhbHMiLCB4bGFiPSJSZXNpZHVhbHMiLCBjb2w9ImxpZ2h0Ymx1ZSIsIGJvcmRlcj0iYmxhY2siKQ0KYGBgDQoNClRoZSBoaXN0b2dyYW0gZGlzcGxheXMgdGhlIGRpc3RyaWJ1dGlvbiBvZiByZXNpZHVhbHMgZnJvbSB0aGUgbW9kZWwsIHNob3dpbmcgdGhlaXIgZnJlcXVlbmN5IGFjcm9zcyBkaWZmZXJlbnQgcmFuZ2VzLiBJdCBzdWdnZXN0cyB0aGF0IHRoZSByZXNpZHVhbHMgYXJlIGFwcHJveGltYXRlbHkgc3ltbWV0cmljIGJ1dCBtYXkgbm90IHBlcmZlY3RseSBmb2xsb3cgYSBub3JtYWwgZGlzdHJpYnV0aW9uLg0KDQojIDIuIFJlc2lkdWFscyB2cyBGaXR0ZWQgVmFsdWVzIFBsb3QNCmBgYHtyfQ0KIyBQbG90IHJlc2lkdWFscyB2cyBmaXR0ZWQgdmFsdWVzDQpwbG90KGZpdHRlZChsbV9tb2RlbCksIHJlc2lkdWFscywgbWFpbj0iUmVzaWR1YWxzIHZzIEZpdHRlZCBWYWx1ZXMiLCB4bGFiPSJGaXR0ZWQgVmFsdWVzIiwgeWxhYj0iUmVzaWR1YWxzIiwgcGNoPTIwKQ0KYWJsaW5lKGg9MCwgY29sPSJyZWQiLCBsd2Q9MikgICMgQWRkIGEgaG9yaXpvbnRhbCBsaW5lIGF0IDANCmBgYA0KDQpUaGUgcGxvdCBzaG93cyByZXNpZHVhbHMgdmVyc3VzIGZpdHRlZCB2YWx1ZXMgZnJvbSBhIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsLiBSZXNpZHVhbHMgYXJlIHNjYXR0ZXJlZCByYW5kb21seSBhcm91bmQgdGhlIHJlZCBsaW5lIGF0IHplcm8sIGluZGljYXRpbmcgbm8gb2J2aW91cyBwYXR0ZXJuIGFuZCBzdWdnZXN0aW5nIHRoZSBtb2RlbCBmaXRzIHRoZSBkYXRhIHJlYXNvbmFibHkgd2VsbC4NCg0KIyAzLiBRUS1QbG90DQoNCmBgYHtyfQ0KIyBHZW5lcmF0ZSBRUS1QbG90DQpxcW5vcm0ocmVzaWR1YWxzKQ0KcXFsaW5lKHJlc2lkdWFscywgY29sPSJyZWQiLCBsd2Q9MikNCmBgYA0KDQpUaGUgTm9ybWFsIFEtUSBwbG90IGNvbXBhcmVzIHRoZSByZXNpZHVhbHMgb2YgYSBtb2RlbCB0byBhIHRoZW9yZXRpY2FsIG5vcm1hbCBkaXN0cmlidXRpb24uIE1vc3QgcG9pbnRzIGFsaWduIHdpdGggdGhlIHJlZCBkaWFnb25hbCBsaW5lLCBpbmRpY2F0aW5nIGFwcHJveGltYXRlIG5vcm1hbGl0eSwgYnV0IGRldmlhdGlvbnMgYXQgdGhlIHRhaWxzIHN1Z2dlc3QgcG90ZW50aWFsIG91dGxpZXJzIG9yIG5vbi1ub3JtYWwgYmVoYXZpb3IuIFRoaXMgbWF5IGltcGFjdCB0aGUgbW9kZWwncyBhc3N1bXB0aW9ucyBhbmQgcmVxdWlyZSBmdXJ0aGVyIGRpYWdub3N0aWNzLg0KDQojIDQuIENvb2sncyBEaXN0YW5jZSBieSBPYnNlcnZhdGlvbg0KDQpgYGB7cn0NCiMgQ2FsY3VsYXRlIENvb2sncyBEaXN0YW5jZQ0KY29va3NfZCA8LSBjb29rcy5kaXN0YW5jZShsbV9tb2RlbCkNCg0KIyBQbG90IENvb2sncyBEaXN0YW5jZSBieSBvYnNlcnZhdGlvbg0KcGxvdChjb29rc19kLCB0eXBlPSJoIiwgbWFpbj0iQ29vaydzIERpc3RhbmNlIGJ5IE9ic2VydmF0aW9uIiwgeGxhYj0iT2JzZXJ2YXRpb24iLCB5bGFiPSJDb29rJ3MgRGlzdGFuY2UiLCBjb2w9ImJsdWUiKQ0KYWJsaW5lKGg9NCAvIGxlbmd0aChjb29rc19kKSwgY29sPSJyZWQiLCBsd2Q9MikgICMgVGhyZXNob2xkIGxpbmUNCmBgYA0KDQpUaGUgZ3JhcGggcmVwcmVzZW50cyBDb29rJ3MgRGlzdGFuY2UgZm9yIGVhY2ggb2JzZXJ2YXRpb24gaW4gYSBkYXRhc2V0LCB3aGljaCBtZWFzdXJlcyB0aGUgaW5mbHVlbmNlIG9mIGluZGl2aWR1YWwgb2JzZXJ2YXRpb25zIG9uIHRoZSByZWdyZXNzaW9uIG1vZGVsLk1vc3Qgb2JzZXJ2YXRpb25zIGhhdmUgdmVyeSBsb3cgQ29vaydzIERpc3RhbmNlIChiZWxvdyB0aGUgcmVkIHRocmVzaG9sZCBsaW5lKSwgaW5kaWNhdGluZyB0aGV5IGhhdmUgbWluaW1hbCBpbXBhY3Qgb24gdGhlIG1vZGVsJ3MgY29lZmZpY2llbnRzLiBBIGZldyBzcGlrZXMgYWJvdmUgdGhlIHJlZCB0aHJlc2hvbGQgbGluZSBzdWdnZXN0IGluZmx1ZW50aWFsIG9ic2VydmF0aW9ucyB0aGF0IGNvdWxkIGRpc3Byb3BvcnRpb25hdGVseSBhZmZlY3QgdGhlIG1vZGVsJ3MgcmVzdWx0cy4NCg0KDQojIEhpZ2hsaWdodCBhbnkgaXNzdWVzIHdpdGggdGhlIG1vZGVsLg0KDQojIEtleSBNb2RlbCBJc3N1ZXMgSWRlbnRpZmllZDoNCk5vbi1Ob3JtYWwgUmVzaWR1YWxzOiBRLVEgcGxvdCBkZXZpYXRpb25zIGF0IHRoZSB0YWlscyBhbmQgaGlzdG9ncmFtIGFzeW1tZXRyeSBzdWdnZXN0IHJlc2lkdWFscyBtYXkgdmlvbGF0ZSBub3JtYWxpdHkgYXNzdW1wdGlvbnMuDQoNClBvdGVudGlhbCBIZXRlcm9zY2VkYXN0aWNpdHk6IFNsaWdodCBmdW5uZWwtc2hhcGVkIHBhdHRlcm4gaW4gcmVzaWR1YWxzIHZzLiBmaXR0ZWQgaW1wbGllcyBub24tY29uc3RhbnQgdmFyaWFuY2UsIHJpc2tpbmcgYmlhc2VkIGVycm9yIGVzdGltYXRlcy4NCg0KSW5mbHVlbnRpYWwgT3V0bGllcnM6IEhpZ2ggQ29vaydzIERpc3RhbmNlIHZhbHVlcyBhbmQgb3V0bGllciByZXNpZHVhbHMgaW5kaWNhdGUgZGF0YSBwb2ludHMgZGlzcHJvcG9ydGlvbmF0ZWx5IGFmZmVjdGluZyBtb2RlbCBhY2N1cmFjeS4NCg0KDQojIEludGVycHJldCBhdCBsZWFzdCBvbmUgb2YgdGhlIGNvZWZmaWNpZW50cy4NCg0KTGV0J3MgaW50ZXJwcmV0IHRoZSAiZGFuY2VhYmlsaXR5IiBjb2VmZmljaWVudCBmcm9tIHRoZSBtb2RlbDoNCg0KQ29lZmZpY2llbnQgZm9yIERhbmNlYWJpbGl0eTogLTQuNzQyZSswMCAob3IgLTQuNzQyKQ0KDQpJbnRlcnByZXRhdGlvbjogRm9yIGV2ZXJ5IG9uZS11bml0IGluY3JlYXNlIGluIGRhbmNlYWJpbGl0eSwgdGhlIGRlcGVuZGVudCB2YXJpYWJsZSAoZS5nLiwgcG9wdWxhcml0eSkgaXMgZXhwZWN0ZWQgdG8gZGVjcmVhc2UgYnkgNC43NDIgdW5pdHMsIGhvbGRpbmcgYWxsIG90aGVyIHZhcmlhYmxlcyBjb25zdGFudC4NCg0KU2lnbmlmaWNhbmNlOiBUaGUgcC12YWx1ZSBmb3IgZGFuY2VhYmlsaXR5IGlzIDAuMDEwODk5LCB3aGljaCBpcyBsZXNzIHRoYW4gMC4wNSwgaW5kaWNhdGluZyB0aGF0IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBkYW5jZWFiaWxpdHkgYW5kIHRoZSBkZXBlbmRlbnQgdmFyaWFibGUgaXMgc3RhdGlzdGljYWxseSBzaWduaWZpY2FudCBhdCB0aGUgNSUgbGV2ZWwuDQoNCkNvbnRleHQ6IEEgbmVnYXRpdmUgY29lZmZpY2llbnQgbWVhbnMgdGhhdCBhcyB0aGUgZGFuY2VhYmlsaXR5IG9mIGEgdHJhY2sgaW5jcmVhc2VzLCB0aGUgb3V0Y29tZSB2YXJpYWJsZSBkZWNyZWFzZXMsIHN1Z2dlc3RpbmcgdGhhdCBoaWdoZXIgZGFuY2VhYmlsaXR5IG1heSBuZWdhdGl2ZWx5IGluZmx1ZW5jZSB0aGUgdGFyZ2V0IHZhcmlhYmxlKHBvcHVsYXJpdHkpIGluIHRoaXMgbW9kZWwuDQoNCg0KDQoNCg0KDQoNCg0K