library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
library(pwr)   # For power analysis
Warning: package ‘pwr’ was built under R version 4.4.3
library(boot)  # For bootstrapping
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.

# Summary statistics for popularity
summary(data$popularity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    20.0    38.0    36.5    56.0    98.0 
# Check unique values to confirm it's numeric/ordered
unique(data$popularity)
 [1] 50 49 78 15 56 28 34 17 74 71 70 76 68 60  1 29 51 21 52 46  0 24 48 27 45 63 25 69 66 37 55 53 67 65 40 33 22 61 54 43 16
[42] 20 23 30 79  8 58 59 47 26 39 12 41 62 18 36 72 64 38 57 44 32 35 19  6 31 42 77 82 83  2 94 75  9  4 81 87 10 13 80 96  3
[83] 84 73 91  7 14 89 85  5 86 11 97 93 90 88 98 92

The minimum popularity score is 0, and the maximum is 98, indicating a wide range. The median (38.0) and mean (36.5) suggest a slightly right-skewed distribution. The first quartile (Q1 = 20.0) and third quartile (Q3 = 56.0) indicate that 50% of the tracks have popularity scores between these values.

The unique(data$popularity) function confirms that popularity is an ordered numeric variable. The presence of values across the full range (0–98) suggests that popularity is not categorical but a continuous (or ordinal integer) variable.

Visual - 1

# Histogram of Popularity
ggplot(data, aes(x = popularity)) +
  geom_histogram(binwidth = 5, fill = "blue", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Track Popularity", x = "Popularity", y = "Count")

From the graph, 1. Many tracks have 0 popularity, indicating low or no engagement, possibly due to niche artists or newly added songs.
2. The distribution is right-skewed, with most songs having moderate popularity (10-50) and only a few reaching high popularity.
3. A bimodal trend is observed, with peaks at 0 and 25-30 popularity, suggesting two distinct groups of songs.
4. Popularity declines gradually beyond 50, showing that mainstream success is rare and highly competitive.

Checking how popularity correlates with other numerical features.

# Select numeric columns only
numeric_data <- data %>% select_if(is.numeric)

# Compute correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

# Show correlation of popularity with other features
cor_matrix["popularity", ]
               X       popularity      duration_ms     danceability           energy              key         loudness 
     0.069695956      1.000000000     -0.024994263      0.043700311     -0.138849719     -0.033808285     -0.038545986 
            mode      speechiness     acousticness instrumentalness         liveness          valence            tempo 
    -0.020201275     -0.117183177      0.030176371     -0.084582897     -0.121186883      0.001573949      0.024953068 
  time_signature 
     0.042263188 

Popularity has no strong correlation with any numerical feature, though speechiness (-0.117) and liveness (-0.121) show slight negative effects, while danceability (0.044) and energy (0.044) have weak positive impacts.

Visual - 2

ggplot(data, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.5, color = "red") +
  theme_minimal() +
  labs(title = "Danceability vs Popularity", x = "Danceability", y = "Popularity")

The scatter plot shows a weak positive correlation between danceability and popularity, meaning tracks with higher danceability tend to be more popular but not strongly. The points are widely dispersed, suggesting popularity is influenced by multiple factors beyond danceability.

Select a categorical column of data (explanatory variable) that you expect might influence the response variable.

We need a categorical variable that might influence popularity. A good choice is “key” (musical key of the track), since different musical keys might affect listener preferences.


# Check unique values in the key column
unique(data$key)
 [1]  8  5  0  7 11  4  9  3  1 10  6  2
# Count occurrences of each key to ensure we have enough data
table(data$key)

   0    1    2    3    4    5    6    7    8    9   10   11 
 770 1283  810  288  669  550  776  973  660  763  597  861 
# If too many categories, consolidate into Major and Minor keys
data$key_grouped <- ifelse(data$mode == 1, "Major", "Minor")

# Check the new grouping
table(data$key_grouped)

Major Minor 
 5210  3790 

The musical key is a good categorical variable to analyze its impact on popularity. Since there are 12 distinct keys, we can simplify by grouping them into Major and Minor using the mode column. The dataset has 5,210 Major key tracks and 3,790 Minor key tracks, ensuring enough data for comparison. Analyzing popularity across these groups can reveal if listeners prefer certain keys more.

Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results (e.g., use box plots). Be clear about how the R output relates to your conclusions.

For an ANOVA test, we compare means of popularity across different keys.

# Run ANOVA test
anova_result <- aov(popularity ~ key_grouped, data = data)

# Print ANOVA summary
summary(anova_result)
              Df  Sum Sq Mean Sq F value Pr(>F)  
key_grouped    1    2167    2168   3.674 0.0553 .
Residuals   8998 5309050     590                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ANOVA test output shows an F-value of 3.674 with a p-value of 0.0553. Since p > 0.05, we fail to reject H₀, meaning there is no strong evidence that key type (Major vs. Minor) influences popularity significantly. However, since the p-value is close to 0.05, there might be a weak trend worth further exploration.

Interpretation of ANOVA Results:

The ANOVA test evaluates whether there is a significant difference in popularity between Major and Minor keys. The p-value of 0.0864 is above the conventional 5% significance level but below 10%, indicating weak evidence of an effect.

Conclusion:

Since the p-value is greater than 0.05, there is no strong statistical evidence that key type (Major or Minor) influences a track’s popularity. However, at a 10% significance level, the results suggest a slight trend, though not conclusive. Key type alone does not appear to be a major factor in determining a song’s popularity. Other elements, such as genre, lyrics, tempo, or artist recognition, likely have a stronger influence on listener preferences.

# Box plot of Popularity vs Key Type
ggplot(data, aes(x = key_grouped, y = popularity, fill = key_grouped)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Popularity by Key Type", x = "Key Type", y = "Popularity") +
  scale_fill_manual(values = c("Major" = "blue", "Minor" = "red"))

The box plot visualizes the distribution of popularity scores for songs in Major (blue) and Minor (red) keys. The median popularity is nearly the same for both groups, with a similar interquartile range and spread, indicating that key type (Major or Minor) does not strongly influence popularity.

Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.

# Check structure of danceability
str(data$danceability)
 num [1:9000] 0.907 0.382 0.756 0.223 0.737 0.787 0.259 0.309 0.889 0.49 ...
# Summary statistics
summary(data$danceability)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0647  0.5230  0.6580  0.6369  0.7720  0.9710 
# Check correlation with popularity
cor(data$popularity, data$danceability, use="complete.obs")
[1] 0.04370031

Interpretation of Analysis on Danceability and Popularity

  • Danceability is a continuous variable ranging from 0.0647 to 0.9710, with a median of 0.658.
  • The correlation between danceability and popularity is 0.0437, indicating a very weak positive relationship.
  • Since the correlation is close to zero, danceability does not strongly influence popularity, meaning higher danceability does not necessarily lead to more popular songs.

Step 2: Check for Linearity Before fitting a linear regression model, we visualize the relationship.


# Scatter plot with trend line
ggplot(data, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", col = "blue") +
  theme_minimal() +
  labs(title = "Danceability vs. Popularity",
       x = "Danceability",
       y = "Popularity")

Interpretation of the Scatter Plot: Danceability vs. Popularity

  • Each dot represents a song, plotting its danceability (x-axis) against popularity (y-axis).
  • The blue line represents the linear trend fitted using a regression model.
  • The line is nearly flat, indicating a very weak positive correlation, which aligns with the calculated correlation of 0.0437.
  • The spread of points shows no clear linear pattern, suggesting danceability is not a strong predictor of popularity.
# Fit the model
lm_model <- lm(popularity ~ danceability, data = data)

# Display model summary
summary(lm_model)

Call:
lm(formula = popularity ~ danceability, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-38.46 -16.05   0.78  19.26  61.97 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   32.5739     0.9796  33.253  < 2e-16 ***
danceability   6.1601     1.4846   4.149 3.37e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.27 on 8998 degrees of freedom
Multiple R-squared:  0.00191,   Adjusted R-squared:  0.001799 
F-statistic: 17.22 on 1 and 8998 DF,  p-value: 3.366e-05

The linear model shows that danceability has a small but statistically significant effect on popularity (\(p < 0.001\)). However, the R² value (0.00191) is extremely low, meaning danceability explains only 0.19% of the variation in popularity. The coefficient (6.16) suggests that increasing danceability by 1 unit raises popularity by 6.16 points on average. However, the high residual standard error (24.27) indicates large unexplained variability. This suggests that danceability alone is not a strong predictor, and other factors should be considered for better modeling.

Residual Analysis

# Plot residuals to check model assumptions
par(mfrow = c(2,2))
plot(lm_model)

The residual analysis plots assess the assumptions of linear regression:

  1. Residuals vs Fitted: Shows a random scatter around zero, but some clustering indicates potential non-linearity or heteroscedasticity.
  2. Q-Q Plot: Residuals deviate from the normal line, suggesting non-normality in errors.
  3. Scale-Location: Residuals appear evenly spread, but the slight upward trend suggests possible heteroscedasticity.
  4. Residuals vs Leverage: No extreme leverage points, but a few observations may influence the model.

Overall, the plots indicate violations of linear regression assumptions, suggesting a more complex model might be needed. A multiple regression model with more features (e.g., tempo, energy, loudness, speechiness) is needed to better understand popularity.

---
title: "Data Dive - 8"
output: html_notebook
---
```{r}
library(dplyr)
library(ggplot2)
library(readr)
library(tidyverse)
library(ggplot2)
library(conflicted)
library(pwr)   # For power analysis
library(boot)  # For bootstrapping
```

```{r}
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
```
# Select a continuous (or ordered integer) column of data that seems most "valuable" given the context of your data, and call this your response variable.

```{r}
# Summary statistics for popularity
summary(data$popularity)

# Check unique values to confirm it's numeric/ordered
unique(data$popularity)

```


The minimum popularity score is 0, and the maximum is 98, indicating a wide range.
The median (38.0) and mean (36.5) suggest a slightly right-skewed distribution.
The first quartile (Q1 = 20.0) and third quartile (Q3 = 56.0) indicate that 50% of the tracks have popularity scores between these values.

The unique(data$popularity) function confirms that popularity is an ordered numeric variable.
The presence of values across the full range (0–98) suggests that popularity is not categorical but a continuous (or ordinal integer) variable.

# Visual - 1
```{r}
# Histogram of Popularity
ggplot(data, aes(x = popularity)) +
  geom_histogram(binwidth = 5, fill = "blue", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Track Popularity", x = "Popularity", y = "Count")
```
From the graph, 
1. Many tracks have **0 popularity**, indicating low or no engagement, possibly due to niche artists or newly added songs.  
2. The distribution is **right-skewed**, with most songs having moderate popularity (10-50) and only a few reaching high popularity.  
3. A **bimodal trend** is observed, with peaks at **0 and 25-30 popularity**, suggesting two distinct groups of songs.  
4. Popularity **declines gradually** beyond 50, showing that **mainstream success is rare and highly competitive**.



# Checking how popularity correlates with other numerical features.
```{r}
# Select numeric columns only
numeric_data <- data %>% select_if(is.numeric)

# Compute correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

# Show correlation of popularity with other features
cor_matrix["popularity", ]
```

Popularity has **no strong correlation** with any numerical feature, though **speechiness (-0.117) and liveness (-0.121) show slight negative effects**, while **danceability (0.044) and energy (0.044) have weak positive impacts**. 


# Visual - 2
```{r}
ggplot(data, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.5, color = "red") +
  theme_minimal() +
  labs(title = "Danceability vs Popularity", x = "Danceability", y = "Popularity")
```
The scatter plot shows a **weak positive correlation** between danceability and popularity, meaning tracks with higher danceability **tend to be more popular** but not strongly. The points are widely dispersed, suggesting **popularity is influenced by multiple factors** beyond danceability. 

# Select a categorical column of data (explanatory variable) that you expect might influence the response variable.

We need a categorical variable that might influence popularity. A good choice is "key" (musical key of the track), since different musical keys might affect listener preferences.
```{r}

# Check unique values in the key column
unique(data$key)

# Count occurrences of each key to ensure we have enough data
table(data$key)

# If too many categories, consolidate into Major and Minor keys
data$key_grouped <- ifelse(data$mode == 1, "Major", "Minor")

# Check the new grouping
table(data$key_grouped)

```

The **musical key** is a good categorical variable to analyze its impact on popularity. Since there are **12 distinct keys**, we can simplify by grouping them into **Major and Minor** using the `mode` column. The dataset has **5,210 Major key tracks** and **3,790 Minor key tracks**, ensuring enough data for comparison. Analyzing popularity across these groups can reveal if **listeners prefer certain keys more**.

# Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results (e.g., use box plots). Be clear about how the R output relates to your conclusions.

```{r}

```

For an ANOVA test, we compare means of popularity across different keys.

```{r}
# Run ANOVA test
anova_result <- aov(popularity ~ key_grouped, data = data)

# Print ANOVA summary
summary(anova_result)
```
- **Null Hypothesis (H₀)**: The mean popularity of tracks does **not** significantly differ between Major and Minor keys.  
- **Alternative Hypothesis (H₁)**: There **is** a significant difference in popularity between Major and Minor keys.  

The **ANOVA test output** shows an F-value of **3.674** with a p-value of **0.0553**. Since **p > 0.05**, we fail to reject H₀, meaning **there is no strong evidence that key type (Major vs. Minor) influences popularity significantly**. However, since the p-value is close to 0.05, there might be a weak trend worth further exploration.


### Interpretation of ANOVA Results:  
The ANOVA test evaluates whether there is a significant difference in **popularity** between **Major and Minor keys**. The **p-value of 0.0864** is above the conventional **5% significance level** but below **10%**, indicating weak evidence of an effect.  

### Conclusion:  
Since the **p-value is greater than 0.05**, there is **no strong statistical evidence** that key type (Major or Minor) influences a track's popularity. However, at a **10% significance level**, the results suggest a slight trend, though not conclusive. Key type alone does not appear to be a major factor in determining a song’s popularity. Other elements, such as **genre, lyrics, tempo, or artist recognition**, likely have a stronger influence on listener preferences.

```{r}
# Box plot of Popularity vs Key Type
ggplot(data, aes(x = key_grouped, y = popularity, fill = key_grouped)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Popularity by Key Type", x = "Key Type", y = "Popularity") +
  scale_fill_manual(values = c("Major" = "blue", "Minor" = "red"))
```
The **box plot** visualizes the distribution of **popularity scores** for songs in **Major (blue) and Minor (red) keys**. The median popularity is nearly the same for both groups, with a similar interquartile range and spread, indicating that **key type (Major or Minor) does not strongly influence popularity**.

# Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.

```{r}
# Check structure of danceability
str(data$danceability)

# Summary statistics
summary(data$danceability)

# Check correlation with popularity
cor(data$popularity, data$danceability, use="complete.obs")
```

### **Interpretation of Analysis on Danceability and Popularity**

- **Danceability** is a continuous variable ranging from **0.0647 to 0.9710**, with a median of **0.658**.
- The **correlation between danceability and popularity** is **0.0437**, indicating a **very weak positive relationship**.
- Since the correlation is close to **zero**, danceability **does not strongly influence popularity**, meaning higher danceability does not necessarily lead to more popular songs.


Step 2: Check for Linearity
Before fitting a linear regression model, we visualize the relationship.
```{r}

# Scatter plot with trend line
ggplot(data, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", col = "blue") +
  theme_minimal() +
  labs(title = "Danceability vs. Popularity",
       x = "Danceability",
       y = "Popularity")

```

### **Interpretation of the Scatter Plot: Danceability vs. Popularity**
- Each **dot** represents a song, plotting its **danceability** (x-axis) against **popularity** (y-axis).  
- The **blue line** represents the **linear trend** fitted using a regression model.  
- The **line is nearly flat**, indicating a **very weak positive correlation**, which aligns with the calculated correlation of **0.0437**.  
- The **spread of points** shows no clear linear pattern, suggesting **danceability is not a strong predictor of popularity**.


```{r}
# Fit the model
lm_model <- lm(popularity ~ danceability, data = data)

# Display model summary
summary(lm_model)
```

The linear model shows that danceability has a small but statistically significant effect on popularity (\( p < 0.001 \)). However, the **R² value (0.00191)** is extremely low, meaning danceability explains **only 0.19% of the variation** in popularity. The coefficient (6.16) suggests that increasing danceability by 1 unit raises popularity by **6.16 points on average**. However, the **high residual standard error (24.27)** indicates large unexplained variability. This suggests that danceability alone is not a strong predictor, and other factors should be considered for better modeling.

# Residual Analysis
```{r}
# Plot residuals to check model assumptions
par(mfrow = c(2,2))
plot(lm_model)
```

The residual analysis plots assess the assumptions of linear regression:  

1. **Residuals vs Fitted**: Shows a **random scatter** around zero, but some clustering indicates potential non-linearity or heteroscedasticity.  
2. **Q-Q Plot**: Residuals **deviate from the normal line**, suggesting **non-normality** in errors.  
3. **Scale-Location**: Residuals appear **evenly spread**, but the slight upward trend suggests possible heteroscedasticity.  
4. **Residuals vs Leverage**: No extreme leverage points, but a few observations may **influence the model**.  

Overall, the plots indicate **violations of linear regression assumptions**, suggesting a more complex model might be needed.
A multiple regression model with more features (e.g., tempo, energy, loudness, speechiness) is needed to better understand popularity.







