A Statistical Analysis of Multi Functioning Rap Gods

Author

Kimberly N. Reiner

Dataset: Spotify and Genius Lyrics

1. Consider the lyrics data

library(haven)
library(tidyverse)
library(fastDummies)
library(nnet)
library(dplyr)

Look at average values for each Rap Artist: name(song title), artist, popularity(song), danceability,loudness, energy, speechiness, acousticness, instrumentalness, liveness, valence, temp, duration_ms, swears, totalWords, uniqueWords, averageLength, flow_wpm, vulgarityRate

library(dplyr)




Top10_Analysis<-read.csv("~/Desktop/Top10.csv")

Top10_Analysis <- as_tibble(Top10_Analysis) %>%
  dplyr::select(name,artist, popularity, danceability, energy, speechiness, acousticness, instrumentalness,loudness, liveness, valence, tempo, duration_ms, swears, totalWords,uniqueWords, averageLength, flow_wpm, vulgarityRate )

# Grouping the data by artist
Top10_Analysis_by_artist <- Top10_Analysis %>%
  group_by(artist)

# Summary statistics for each artist
summary_by_artist <- summarise(Top10_Analysis_by_artist,
                                mean_popularity = mean(popularity, na.rm = TRUE),
                                mean_danceability = mean(danceability, na.rm = TRUE),
                                mean_energy = mean(energy, na.rm = TRUE),
                                mean_speechiness = mean(speechiness, na.rm = TRUE),
                                mean_acousticness = mean(acousticness, na.rm = TRUE),
                                mean_instrumentalness = mean(instrumentalness, na.rm = TRUE),                        mean_loudness = mean(loudness, na.rm = TRUE),
                                mean_liveness = mean(liveness, na.rm = TRUE),
                                mean_valence = mean(valence, na.rm = TRUE),
                                mean_tempo = mean(tempo, na.rm = TRUE),
                                mean_duration_ms = mean(duration_ms, na.rm = TRUE),
                                mean_swears = mean(swears, na.rm = TRUE),
                                mean_totalWords = mean(totalWords, na.rm = TRUE),
                                mean_uniqueWords = mean(uniqueWords, na.rm = TRUE),
                                mean_averageLength = mean(averageLength, na.rm = TRUE),
                                mean_flow_wpm = mean(flow_wpm, na.rm = TRUE),
                                mean_vulgarityRate = mean(vulgarityRate, na.rm = TRUE))





library(dplyr)

# Splitting columns into three groups to make it easier to create graphics.
n_cols <- ncol(summary_by_artist)
n_cols_per_table <- (n_cols - 1) %/% 3  # Calculate approximately equal number of columns for each table

# Create three separate tables with "artist" column repeated to keep track of this label.
table1 <- summary_by_artist[, c( 1, 1, 2:(n_cols_per_table + 1))]
table2 <- summary_by_artist[, c( 1, 1, (n_cols_per_table + 2):(2 * n_cols_per_table + 1))]
table3 <- summary_by_artist[, c(1, 1, (2 * n_cols_per_table + 2):n_cols)]

# Rename the repeated "artist" column in each table
colnames(table1)[2] <- "artist"
colnames(table2)[2] <- "artist"
colnames(table3)[2] <- "artist"

# Print or use these tables as needed
print(table1)

# A tibble: 10 × 7
   artist  artist mean_popularity mean_danceability mean_energy mean_speechiness
   <chr>   <chr>            <dbl>             <dbl>       <dbl>            <dbl>
 1 2Pac    2Pac              44.3             0.789       0.735            0.235
 2 Drake   Drake             57.8             0.648       0.557            0.220
 3 Eminem  Eminem            58.1             0.740       0.770            0.257
 4 JAY-Z   JAY-Z             39.9             0.680       0.767            0.303
 5 Kendri… Kendr…            50.6             0.626       0.644            0.275
 6 Lil Wa… Lil W…            34.7             0.685       0.698            0.263
 7 Nas     Nas               38.2             0.650       0.753            0.294
 8 Nicki … Nicki…            48.9             0.716       0.683            0.203
 9 Snoop … Snoop…            25.6             0.688       0.714            0.248
10 The No… The N…            22.9             0.711       0.731            0.312
# ℹ 1 more variable: mean_acousticness <dbl>

print(table2)

# A tibble: 10 × 7
   artist  artist mean_instrumentalness mean_loudness mean_liveness mean_valence
   <chr>   <chr>                  <dbl>         <dbl>         <dbl>        <dbl>
 1 2Pac    2Pac                0.00893          -6.80         0.199        0.636
 2 Drake   Drake               0.00814          -7.92         0.195        0.353
 3 Eminem  Eminem              0.000122         -4.51         0.266        0.540
 4 JAY-Z   JAY-Z               0.0116           -5.59         0.233        0.577
 5 Kendri… Kendr…              0.00237          -8.00         0.248        0.470
 6 Lil Wa… Lil W…              0.000587         -6.15         0.241        0.561
 7 Nas     Nas                 0.000193         -5.91         0.226        0.554
 8 Nicki … Nicki…              0.00860          -6.13         0.219        0.461
 9 Snoop … Snoop…              0.00298          -5.97         0.250        0.571
10 The No… The N…              0.00208          -6.32         0.212        0.598
# ℹ 1 more variable: mean_tempo <dbl>

print(table3)

# A tibble: 10 × 9
   artist   artist mean_duration_ms mean_swears mean_totalWords mean_uniqueWords
   <chr>    <chr>             <dbl>       <dbl>           <dbl>            <dbl>
 1 2Pac     2Pac            262728.        30.1            777.             292.
 2 Drake    Drake           240018.        10.2            536.             210.
 3 Eminem   Eminem          267958.        15.8            852.             344.
 4 JAY-Z    JAY-Z           246930.        15.0            648.             264.
 5 Kendric… Kendr…          261510.        13.8            681.             268.
 6 Lil Way… Lil W…          228370.        21.8            629.             237.
 7 Nas      Nas             220992.        11.2            608.             281.
 8 Nicki M… Nicki…          219241.        13.9            540.             194.
 9 Snoop D… Snoop…          226943.        20.1            547.             207.
10 The Not… The N…          253524.        24.3            674.             284.
# ℹ 3 more variables: mean_averageLength <dbl>, mean_flow_wpm <dbl>,
#   mean_vulgarityRate <dbl>

# Display summary statistics for each artist
#print(summary_by_artist)

Playing around with correlations.

library(dplyr)
library(ggplot2)

library(dplyr)
library(ggplot2)

# Calculate correlation coefficients
correlation_matrix <- cor(Top10_Analysis[, c("popularity", "danceability", "energy", "speechiness", "acousticness", "instrumentalness", "loudness", "liveness", "valence", "tempo", "duration_ms", "swears", "totalWords", "uniqueWords", "averageLength", "flow_wpm", "vulgarityRate")])

# Convert correlation matrix to tidy format for visualization
correlation_data <- as.data.frame(as.table(correlation_matrix))
colnames(correlation_data) <- c("Variable1", "Variable2", "Correlation")

# Filter out correlations with popularity (as this is the metric we would like to consider)
flow_wpm_correlation <- correlation_data %>%
  filter(Variable1 == "popularity" | Variable2 == "popularity") %>%
  filter(Variable1 != Variable2)

# Plot correlation coefficients
ggplot(flow_wpm_correlation, aes(x = Variable2, y = Correlation, fill = Variable1)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  theme_minimal() +
  labs(x = "Variable", y = "Correlation with popularity", fill = "Variable") +
  ggtitle("Correlation of Variables with popularity")

# Extract correlations with popularity
popularity_correlation <- correlation_matrix["popularity", ]

# Print correlation coefficient for popularity
print(popularity_correlation)

      popularity     danceability           energy      speechiness 
    1.0000000000     0.0213578565    -0.1043837420    -0.1014664999 
    acousticness instrumentalness         loudness         liveness 
    0.0071366723     0.0083631240    -0.0000438763    -0.0508413913 
         valence            tempo      duration_ms           swears 
   -0.1879453939     0.0742749720     0.1134398258    -0.0872883322 
      totalWords      uniqueWords    averageLength         flow_wpm 
    0.1202335440     0.1173026217     0.0159271752    -0.0081929663 
   vulgarityRate 
   -0.1380476354

# Extract correlation coefficients with "popularity"
popularity_correlation <- correlation_matrix["popularity", ]

# Remove "popularity" from the correlation coefficients
popularity_correlation <- popularity_correlation[-which(names(popularity_correlation) == "popularity")]

# Sort correlation coefficients in descending order
top_positive_correlations <- sort(popularity_correlation, decreasing = TRUE)

# Select the top 5 positive coefficients
top_5_positive <- head(top_positive_correlations, 5)

# Print the top 5 positive coefficients
print(top_5_positive)

  totalWords  uniqueWords  duration_ms        tempo danceability 
  0.12023354   0.11730262   0.11343983   0.07427497   0.02135786

Top10_Analysis<-read.csv("~/Desktop/Top10.csv")

Top10_Analysis <- as_tibble(Top10_Analysis) %>%
  dplyr::select(name, artist, popularity, uniqueWords, swears, energy, totalWords, flow_wpm, averageLength, duration_ms, tempo, danceability)


#Make a model with respect to popularity
m1 <- lm(popularity ~ totalWords + uniqueWords + duration_ms + tempo + danceability, data = Top10_Analysis)


summary(m1)


Call:
lm(formula = popularity ~ totalWords + uniqueWords + duration_ms + 
    tempo + danceability, data = Top10_Analysis)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.195 -12.724  -1.517  11.961  53.767 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.268e+01  3.056e+00   7.421 1.65e-13 ***
totalWords   4.934e-04  3.731e-03   0.132   0.8948    
uniqueWords  1.628e-02  8.185e-03   1.989   0.0468 *  
duration_ms  2.142e-05  8.563e-06   2.502   0.0124 *  
tempo        4.850e-02  1.207e-02   4.018 6.06e-05 ***
danceability 3.594e+00  2.816e+00   1.276   0.2020    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.85 on 2233 degrees of freedom
Multiple R-squared:  0.02476,   Adjusted R-squared:  0.02257 
F-statistic: 11.34 on 5 and 2233 DF,  p-value: 8.078e-11

hist(Top10_Analysis$popularity)

almost_sas <- function(aov.results){
  aov_residuals <- residuals(aov.results)
  par(mfrow=c(2,2))
  plot(aov.results, which=1)
  hist(aov_residuals)
  plot(aov.results, which=2)
}
almost_sas(m1)

# Fit the linear regression model
m1 <- lm(popularity ~ totalWords + uniqueWords + duration_ms + tempo + danceability , data = Top10_Analysis)

# Residual Analysis
# 1. Linearity Check
plot(m1, which = 1)

# 2. Independence of Errors Check
# Example: Durbin-Watson test
library(car)
durbinWatsonTest(m1)

 lag Autocorrelation D-W Statistic p-value
   1       0.4055432      1.188721       0
 Alternative hypothesis: rho != 0

# 3. Homoscedasticity Check
# Plotting residuals against fitted values
plot(m1, which = 3)

# 4. Normality of Residuals Check
# Example: QQ plot
qqnorm(resid(m1))
qqline(resid(m1))

# Example: Shapiro-Wilk test
shapiro.test(resid(m1))


    Shapiro-Wilk normality test

data:  resid(m1)
W = 0.99195, p-value = 8.429e-10

# 5. Multicollinearity Check
# Example: Variance Inflation Factor (VIF)
library(car)
vif(m1)

  totalWords  uniqueWords  duration_ms        tempo danceability 
    4.536690     3.347790     1.918308     1.025490     1.094735

# 6. Outliers and Influential Points Check
# Example: Cook's distance
plot(m1, which = 4)

# Example: Identify influential points
influencePlot(m1)

       StudRes         Hat        CookD
155  2.7561756 0.003449488 0.0043695501
398  3.0209612 0.002204150 0.0033478115
803 -1.4503270 0.028990982 0.0104617834
864  0.1767099 0.034290579 0.0001848788
869 -1.3976496 0.031896184 0.0107220232

The most appropriate method is to utilize an lm model (Continuous and normalized) since the data is both continuous and normal. My evidence is that my Q-Q residuals are linear and my histogram shows a “roughly” normal distribution. We also can see some of our outliers in our testing. :)

hist(Top10_Analysis$popularity)

coefficients(m1)

 (Intercept)   totalWords  uniqueWords  duration_ms        tempo danceability 
2.267975e+01 4.934232e-04 1.627919e-02 2.142283e-05 4.850023e-02 3.593523e+00

Model:

\[\hat{y}=\text{22.67975+0.0004934232totalWords+0.01627919uniqueWords+0.00002142283duration_ms+0.04850023tempo+3.593523danceability} \]

Check the relevant assumptions of the model. Are we okay to proceed with statistical inference?

almost_sas <- function(aov.results){
  aov_residuals <- residuals(aov.results)
  par(mfrow=c(2,2))
  plot(aov.results, which=1)
  hist(aov_residuals)
  plot(aov.results, which=2)
}
almost_sas(m1)

Yes. We can continue with the statistical inference. The most appropriate method is to utilize an lm model since the data is both continuous and normal. My evidence is that my Q-Q residuals are linear and my histogram shows a normal distribution albeit with a slight skew, but this is not enough to deviate from a normalized model.) :)

1b-iv. Which, if any, are significant predictors of popularity? Test at the \(\alpha=0.05\) level.

summary(m1)


Call:
lm(formula = popularity ~ totalWords + uniqueWords + duration_ms + 
    tempo + danceability, data = Top10_Analysis)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.195 -12.724  -1.517  11.961  53.767 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.268e+01  3.056e+00   7.421 1.65e-13 ***
totalWords   4.934e-04  3.731e-03   0.132   0.8948    
uniqueWords  1.628e-02  8.185e-03   1.989   0.0468 *  
duration_ms  2.142e-05  8.563e-06   2.502   0.0124 *  
tempo        4.850e-02  1.207e-02   4.018 6.06e-05 ***
danceability 3.594e+00  2.816e+00   1.276   0.2020    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.85 on 2233 degrees of freedom
Multiple R-squared:  0.02476,   Adjusted R-squared:  0.02257 
F-statistic: 11.34 on 5 and 2233 DF,  p-value: 8.078e-11

Model:

\[\hat{y}=\text{22.67975+0.0004934232totalWords+0.01627919uniqueWords+0.00002142283duration_ms+0.04850023tempo+3.593523danceability} \] Hypotheses: \[H_0:\beta_\text{totalWords}=\beta_\text{uniqueWords }=\beta_\text{duration_ms }=\beta_\text{tempo }=\beta_\text{danceability}=0\] \[H_1:\text{At least one }\beta_{\text{i}}\ne 0\]

Test Statistic and p-Value for predictor (popularity): \[F_{0} = 11.34 \text{ (p} <0.001)\] Conclusion/Interpretation: \[\text{Reject } H_{0}. \text{ There is sufficient evidence to suggest that at least one slope is non-zero.}\]

Hypotheses: \[H_{0}:\beta_{totalWords}=0\]

\[H_{1}:\beta_{totalWords}\neq0\]

Test statistic and p-Value: \[t_{0}=0.132 \text{ } (p=0.8948)\]