INTRODUCTION

In this R project, we will embark on a two-phase data analysis journey. The initial phase centers on the application of simple linear regression models, followed by the second phase, which involves the utilization of k-nearest neighbors modeling. The objective is to investigate datasets pertaining to tech industry salaries and Spotify music information, with the ultimate aim of deriving meaningful insights and generating predictive outcomes. To initiate the process, we will load the essential libraries and ready the data for analysis.

Load necessary packages:

library(ggplot2)
library(forecast)
library(dplyr)
library(caret)
library(FNN)

PART I - SIMPLE LINEAR REGRESSION

In the first part of our analysis, we delve into the world of Simple Linear Regression (SLR). Our journey begins by preparing the data, exploring its nuances, and understanding the relationships between variables. We aim to uncover insights into the tech industry’s compensation dynamics and build a predictive model that can estimate total yearly compensation based on years of experience. Through careful examination and statistical analysis, we shed light on the correlation between these factors and unveil the model’s performance in predicting compensation. Join us as we navigate the intricacies of SLR to gain valuable insights that benefit HR professionals, hiring managers, and tech industry employees.

1. Data Preparation

The initial step involved loading the necessary packages and importing the dataset, “tech_salary_dataset.csv,” into an R data frame named “salary.df.” This ensured that our data was ready for further analysis.

# Load dataset
salary.df <- read.csv("tech_salary_dataset.csv")

head(salary.df)  # Return first few rows 

Data Exploration

Our data exploration phase included an examination of the dataset’s structure, identifying both numeric and categorical variables. This section provided an overview of the dataset’s key characteristics.

# View structure of dataset
str(salary.df)
'data.frame':   62642 obs. of  24 variables:
 $ timestamp              : chr  "6/7/2017 11:33:27" "6/10/2017 17:11:29" "6/11/2017 14:53:57" "6/17/2017 0:23:14" ...
 $ company                : chr  "Oracle" "eBay" "Amazon" "Apple" ...
 $ level                  : chr  "L3" "SE 2" "L7" "M1" ...
 $ title                  : chr  "Product Manager" "Software Engineer" "Product Manager" "Software Engineering Manager" ...
 $ totalyearlycompensation: int  127000 100000 310000 372000 157000 208000 300000 156000 120000 201000 ...
 $ location               : chr  "Redwood City, CA" "San Francisco, CA" "Seattle, WA" "Sunnyvale, CA" ...
 $ yearsofexperience      : num  1.5 5 8 7 5 8.5 15 4 3 12 ...
 $ yearsatcompany         : num  1.5 3 0 5 3 8.5 11 4 1 6 ...
 $ tag                    : chr  NA NA NA NA ...
 $ basesalary             : int  107000 0 155000 157000 0 0 180000 135000 0 157000 ...
 $ stockgrantvalue        : num  20000 0 0 180000 0 0 65000 8000 0 26000 ...
 $ bonus                  : num  10000 0 0 35000 0 0 55000 13000 0 28000 ...
 $ gender                 : chr  NA NA NA NA ...
 $ otherdetails           : chr  NA NA NA NA ...
 $ cityid                 : int  7392 7419 11527 7472 7322 11527 11521 11527 11521 11527 ...
 $ dmaid                  : int  807 807 819 807 807 819 819 819 819 819 ...
 $ rowNumber              : int  1 2 3 7 9 11 12 13 15 16 ...
 $ Masters_Degree         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Bachelors_Degree       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Doctorate_Degree       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Highschool             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Some_College           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Race                   : chr  NA NA NA NA ...
 $ Education              : chr  NA NA NA NA ...

The numerical variables in the dataset include: total yearly compensation, total years of experience, years at the company, base salary, stock grant value and bonus.

On the other hand, the categorical variables in the dataset include: timestamp, company, level, title, location, tag, gender, other details, cityid, dmaid, row number, masters degree, bachelors degree, doctorate degree, highschool, some college, race, and education.

Data Partitioning

Before conducting deeper analysis, the data was partitioned into training and validation sets. This crucial step ensures the model’s relevance and accuracy.

# Set seed for replicability of sample/output
set.seed(80)  

# Sample 60% of dataset & assign it to index for training set
train.index <- sample(row.names(salary.df), 0.6*dim(salary.df)[1])

# Assign remaining rows to index for validation set
valid.index <- setdiff(row.names(salary.df), train.index)  

# Subset all of the rows for all the columns associated with the training data/validation data indexes
train.df <- salary.df[train.index, ]
valid.df <- salary.df[valid.index, ]

It is important to partition the data before doing any sort of in-depth analysis of the variables in order to ensure the model is relevant to the data (i.e., the training data, which is used to fit the model); and performs accurately in classifying/predicting new records (i.e., using validation data to ensure the model was not “overfit” to the training data).

EDA - Data Visualization + Summary Stats

Exploratory data analysis involved data visualization and summary statistics. We created scatterplots, calculated correlations, and interpreted the relationships between variables.

# Create a scatterplot of yearsofexperience vs. totalyearlycompensation
ggplot(data= train.df, aes(x= yearsofexperience, y= totalyearlycompensation)) + geom_point() + labs(title = "Yearly Compensation vs. Years of Experience", x= "Years of Experience", y= "Total Yearly Compensation") + geom_smooth(method = "lm")


# the geom_smooth() function allows us to add a best-fit line

This plot suggests that the two variables have a moderate positive correlation. This is evident based on the close proximity of the points to the best-fit line (i.e., given a few potential outliers) and suggests that total yearly compensation will increase as a tech employee’s number of years of experience in the industry increases. Ultimately, this relationship makes intuitive sense to me because typically the longer you stay in a certain industry (e.g., in a certain position or at a certain company), the more knowledgeable/qualified you will be. As a consequence, more likely you are to receive a higher compensation in the following year/period – due to experience. Likewise, it is not usually common for an employee’s salary to decrease over time, especially when they are holding the same position/gaining more experience over time.

# Calculate the correlation between yearsofexperience & totalyearlycompensation
cor(x= train.df$yearsofexperience, y= train.df$totalyearlycompensation)
[1] 0.4263163
# Compute significance of correlation
cor.test(x= train.df$yearsofexperience, y= train.df$totalyearlycompensation)

    Pearson's product-moment correlation

data:  train.df$yearsofexperience and train.df$totalyearlycompensation
t = 91.366, df = 37583, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4180081 0.4345532
sample estimates:
      cor 
0.4263163 

The correlation observed between “yearsofexperience” and “totalyearlycompensation” in the training dataset is approximately 0.4263. This value suggests a moderate positive correlation, indicating that as years of experience increase, total yearly compensation tends to rise as well.

When subjected to statistical analysis using Pearson’s correlation test, the results confirm the significance of this relationship. The remarkably low p-value (less than 2.2e-16) unequivocally demonstrates that the observed correlation is highly meaningful. It provides compelling evidence that the connection between years of experience and total yearly compensation is not merely a random occurrence.

Furthermore, the 95 percent confidence interval for the correlation, ranging from 0.4180 to 0.4346, narrows down the likely range within which the true correlation resides. This not only reinforces the concept of a substantial and consistent link between these two variables but also enhances our confidence in the findings.

While the correlation may not be exceptionally strong, it is statistically significant and highlights a meaningful connection within the dataset. In essence, as individuals accumulate more years of experience, their total yearly compensation tends to exhibit a discernible upward trend, as substantiated by the data and statistical analysis.

Linear Regression Model

In this next step a simple linear regression model is constructed to predict total yearly compensation based on years of experience. The model’s summary and a hypothetical input prediction were included to illustrate its functionality.

# Create SLR model
model <- lm(totalyearlycompensation ~ yearsofexperience, data = train.df)

# Generate model summary
summary(model)

Call:
lm(formula = totalyearlycompensation ~ yearsofexperience, data = train.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-744515  -71070  -14103   45638 4663543 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       143168.5     1036.0  138.20   <2e-16 ***
yearsofexperience  10193.4      111.6   91.37   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 126600 on 37583 degrees of freedom
Multiple R-squared:  0.1817,    Adjusted R-squared:  0.1817 
F-statistic:  8348 on 1 and 37583 DF,  p-value: < 2.2e-16

Model Evaluation

In the following section, we evaluated the model’s performance using residuals analysis, highlighting instances where the model’s predictions deviated significantly from actual values.

# Generate summary of model residuals
summary(model$residuals)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-744515  -71070  -14103       0   45638 4663543 

In this model, the minimum residual value is -744,515 & the maximum residual value is 4,663,543.

# Create df with fitted values (predictions) & residuals
residual.df <- data.frame("Residuals" = model$residuals, "Predicted Total Comp" = model$fitted.values)

# Bind the 'residual.df' to the df with our training data
train.df <- cbind(train.df, residual.df)

# Returns first few rows of data to ensure successful binding of the 2 dataframes
head(train.df)  
# Identify the observation whose rating generated the highest residual value
subset(train.df, Residuals == max(train.df$Residuals))

Based on the observation whose rating generated the highest residual value, the person’s actual total compensation was 4,980,000 (dollars); while the model predicted it would be 316,456.80 (dollars). The residual is calculated by taking the difference between the actual total compensation for each employee and what the regression model predicted it to be (i.e., 4980000 - 316456.80 = 4663543.30).

# Identify the observation whose rating generated the lowest residual value
subset(train.df, Residuals == min(train.df$Residuals))

Based on the observation whose rating generated the lowest residual value, the person’s actual total compensation was 102,000 (dollars); while the model predicted it would be 846,515.10 (dollars). The residual is calculated by taking the difference between the actual total compensation for each employee and what the regression model predicted it to be (i.e., 102000 - 846515.10 = -744,515.10).

It looks like there are some cases where this model is quite a bit “off the mark”. Although intuitively it makes sense that years of experience could be a factor in the total compensation an employee in the high tech industry receives, there a few possible reasons why years of experience might not closely predict total compensation. Specifically, it is possible that “years of experience” is correlated with one or more of the other input variables in the model, which exposes the model to the dangers of multi-collinearity. There is also the case where variable reduction is the focus; in which there are just “better” input variables that act as predictors for total compensation – with years of experience being one of the potential variables under the scope for elimination/being “dropped.” For example, based on the description of each of the variables in the original dataset, we know that total yearly compensation is the sum of several other numerical values - including base salary, bonus, & stock grant value.

Summary of Linear Regression Equation

summary(model)  # Return model summary to obtain coefficients (m, slope & b, y-intercept)

Call:
lm(formula = totalyearlycompensation ~ yearsofexperience, data = train.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-744515  -71070  -14103   45638 4663543 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       143168.5     1036.0  138.20   <2e-16 ***
yearsofexperience  10193.4      111.6   91.37   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 126600 on 37583 degrees of freedom
Multiple R-squared:  0.1817,    Adjusted R-squared:  0.1817 
F-statistic:  8348 on 1 and 37583 DF,  p-value: < 2.2e-16

Therefore, the regression equation generated by the model is: Total yearly compensation (y) = 10193.4(x) + 143168.5

Make up a hypothetical input value and use it to make a new prediction.

new_input <- 10

total_yearly_compensation <- function(x){
  return(10193.4*x + 143168.5)
}
total_yearly_compensation(10)
[1] 245102.5

In sum, ‘10’ was used as the hypothetical input value for “yearsofexperience” in the regression equation to predict as an outcome, the total yearly compensation of an employee with 10 years of experience in the high tech industry. As such, for an employee who has 10 years of experience in the high tech industry, the predicted total yearly compensation is 245,102.50 (dollars).

Training set:

accuracy(train.df$Predicted.Total.Comp, train.df$totalyearlycompensation)
                   ME     RMSE      MAE       MPE     MAPE
Test set 1.974391e-12 126604.5 82928.08 -39.05618 60.69246

Validation Set:

salary.pred <- predict(model, valid.df)
accuracy(salary.pred, valid.df$totalyearlycompensation)
               ME     RMSE      MAE       MPE     MAPE
Test set -757.419 122780.7 82832.89 -40.28764 62.02053

The purpose of making this comparison is to ensure the model performs well in predicting new records (i.e., evaluating its predictive accuracy). This means that the model is not overfit to the predictor values in the training set (i.e., when applied to new, unknown records).

In particular, the root-mean-squared error (RMSE) tells us about the “error magnitude” in units that correspond to the output (e.g., total yearly compensation); while the mean absolute error (MAE) tells us about the magnitude of the average absolute error. These prediction accuracy measures corresponding to the training data tell us about model fit when comparing the predicted/actual values of total yearly compensation; while the prediction accuracy measures that correspond to the validation set communicate the model’s ability to predict new records that are unfamiliar to the model/training data. In general, RMSE/MAE should be lower for the training data than the validation data, as the training data was used to fit the regression model (i.e., with low error magnitude indicating better fit).

However, even with low RMSE and MAE on training data, bias may persist - systematic errors that consistently push predictions in one direction. To detect bias, the RMSE and MAE are evaluated on the validation set. If they significantly differ from training data, it may indicate bias. In essence, this comparison ensures the model’s generalization and detects potential bias that might not be apparent solely from training data.

Model Comparison

Finally, we compared the model’s root-mean-squared error (RMSE) to the standard deviation of total compensation in the training set, providing insights into its performance.

sd(train.df$totalyearlycompensation)
[1] 139962.2

The model’s RMSE is less than the standard deviation of total compensation in the training set. Comparing our model’s RMSE to the standard deviation of total compensation reveals key insights into its performance.

In essence, when the RMSE exceeds or equals the standard deviation, it signals that our model’s predictions are no better than a “null” model that merely guesses the mean. This suggests the model isn’t effectively learning from the input features and lacks meaningful predictive power.

Conversely, a lower RMSE compared to the standard deviation indicates that the model is outperforming this basic baseline. It demonstrates the model’s ability to capture valuable patterns and make more accurate predictions.

In summary, this comparison not only gauges predictive accuracy but also highlights whether the model adds value beyond a simplistic approach.

Conclusions (SLR)

In this section, we distill the key insights gained from our analysis and discuss their relevance to various stakeholders. These insights shed light on the relationship between years of experience and total yearly compensation in the tech industry and offer practical implications for decision-makers.

Key Insights
  • Moderate Positive Correlation: Our analysis revealed a moderate positive correlation between years of experience and total yearly compensation, indicating that, on average, employees earn higher compensation as they gain more experience.

  • Statistical Significance: Statistical tests confirmed the significance of this correlation, providing robust evidence that the relationship is not random.

  • Linear Regression Model: The linear regression model provided a quantitative representation of the relationship, facilitating salary predictions based on years of experience.

  • Model Performance: Our model demonstrated good performance in predicting compensation, as indicated by low RMSE and MAE values on both training and validation datasets.

  • Bias Detection: While generally performing well, the model exhibited bias in some cases, highlighting areas for potential improvement.

  • Model Value: Comparing the RMSE to the standard deviation demonstrated that the model adds value by making more accurate predictions than a mean-based model.

Stakeholder Implications
  • HR Departments: HR professionals can utilize the linear regression model to estimate compensation for potential hires, aiding in competitive job offers.

  • Hiring Managers: Understanding the relationship between experience and compensation equips hiring managers to make informed decisions during the hiring process.

  • Employees: Current employees can gain insights into their potential career trajectories and financial goals based on how their compensation may evolve.

  • Data Analysts: The provided code and process serve as a template for data analysts and data scientists to conduct similar analyses in their organizations, exploring relationships between variables and building predictive models.

In conclusion, this analysis provides valuable insights into tech industry compensation dynamics and offers a practical model for predicting salaries based on experience, benefiting a range of stakeholders in decision-making roles.

PART II - K-NEAREST NEIGHBORS

In the second part of our analysis, we venture into the realm of K-Nearest Neighbors (k-NN) modeling. Our exploration begins by selecting a song from the Spotify Top 200 Weekly Charts of 2022 and dissecting its attributes. We leverage k-NN to predict whether this song will be liked by a user, George, and uncover its nearest musical companions. Along the way, we optimize the k-value for our model, reduce dimensionality for efficiency, and normalize the data for equitable comparisons. This section not only showcases the power of k-NN in music recommendation but also serves as a blueprint for data analysts and stakeholders aiming to enhance user experiences and decision-making in the dynamic world of music data analysis. Join us as we harmonize data science and music to predict likability and uncover the secrets of k-NN.

Load Data

The code begins by loading the ‘spotify_top_charts_22.csv’ dataset, which contains information about songs that appeared on the Top 200 Weekly (Global) charts of Spotify in 2022. Each row represents a specific song.

top.charts.df <- read.csv("spotify_top_charts_22.csv")

top.charts.df 

Song Selection

First, we will select a song for analysis.

I chose the song “Cold Heart” by Elton John, Dua Lipa, & PNAU (row 6).

top.charts.df[6, ]  # returns the row for the song "Cold Heart"

Song Information

Various attributes of the selected song are queried, such as danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, tempo, and duration_ms.

danceability:

top.charts.df[6, "danceability"]
[1] 0.795

energy:

top.charts.df[6, "energy"]
[1] 0.8

loudness:

top.charts.df[6, "loudness"]
[1] -6.32

speechiness:

top.charts.df[6, "speechiness"]
[1] 0.0309

acousticness:

top.charts.df[6, "acousticness"]
[1] 0.0354

instrumentalness:

options(scipen = 999)
top.charts.df[6, "instrumentalness"]
[1] 0.0000725

liveness:

top.charts.df[6, "liveness"]
[1] 0.0915

tempo:

top.charts.df[6, "tempo"]
[1] 116.032

duration_ms:

top.charts.df[6, "duration_ms"]
[1] 202735

Data Exploration

In this step, we explore the datasets & examine the target variable.

# Subset selected song
song.df <- top.charts.df[6, ]

# Examine structure of dataset
str(song.df)  
'data.frame':   1 obs. of  17 variables:
 $ uri             : chr "spotify:track:7rglLriMNBPAyuJOMGwi39"
 $ artist_names    : chr "Elton John, Dua Lipa, PNAU"
 $ track_name      : chr "Cold Heart - PNAU Remix"
 $ peak_rank       : int 4
 $ weeks_on_chart  : int 32
 $ danceability    : num 0.795
 $ energy          : num 0.8
 $ key             : int 1
 $ loudness        : num -6.32
 $ mode            : int 1
 $ speechiness     : num 0.0309
 $ acousticness    : num 0.0354
 $ instrumentalness: num 0.0000725
 $ liveness        : num 0.0915
 $ tempo           : num 116
 $ time_signature  : int 4
 $ duration_ms     : int 202735

Read Additional Dataset

A second dataset, ‘spotify_data.csv,’ is read into the environment, and its structure is examined. It contains ratings of various songs, with the ‘target’ variable indicating whether George liked the song (1 for liked, 0 for disliked).

# Read dataset containing George's ratings of various songs into environment
spotify.df <- read.csv("spotify_data.csv")

# Examine structure of dataset
str(spotify.df)
'data.frame':   2017 obs. of  17 variables:
 $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
 $ acousticness    : num  0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ...
 $ danceability    : num  0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ...
 $ duration_ms     : int  204600 326933 185707 199413 392893 251333 241400 349667 202853 226840 ...
 $ energy          : num  0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ...
 $ instrumentalness: num  0.0219 0.00611 0.000234 0.51 0.512 0 0.00000727 0.664 0 0 ...
 $ key             : int  2 1 2 5 5 8 1 10 11 7 ...
 $ liveness        : num  0.165 0.137 0.159 0.0922 0.439 0.164 0.207 0.16 0.342 0.571 ...
 $ loudness        : num  -8.79 -10.4 -7.15 -15.24 -11.65 ...
 $ mode            : int  1 1 1 1 0 1 1 0 0 1 ...
 $ speechiness     : num  0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ...
 $ tempo           : num  150.1 160.1 75 86.5 174 ...
 $ time_signature  : num  4 4 4 4 4 4 4 4 4 4 ...
 $ valence         : num  0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ...
 $ target          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ song_title      : chr  "Mask Off" "Redbone" "Xanny Family" "Master Of None" ...
 $ artist          : chr  "Future" "Childish Gambino" "Future" "Beach House" ...

The variable ‘target’ is a categorical variable as it tells us whether or not George liked the song (i.e., 1 = yes he liked the song, 0 = he did not like the song). Since the categories are communicated using binary dummy variables, the data structure is integer or numeric form.

If target is not currently a factor, convert it into a factor. It will be our response variable in this model. Target tells us whether George, the person who uploaded this dataset, liked the song. “1” means that George liked it, and “0” means that he did not.

spotify.df$target <- as.factor(spotify.df$target)
# Return table with unique values & the total count of observations corresponding to each of those unique values
table(spotify.df$target)  

   0    1 
 997 1020 

The unique values are “0” (i.e., George did not like the song) & “1” (i.e., George did like the song). There are 997 songs he disliked and 1020 that he liked.

Handling Missing Values

Next, we check for missing values in the dataset and find that there are none.

cat("There are", sum(is.na(spotify.df)), "missing values in the dataset.")
There are 0 missing values in the dataset.

Dimension Reduction

Unnecessary columns (e.g., ‘key’, ‘mode’, ‘time_signature’) are removed from the dataset, aiming to simplify the k-nearest neighbors (k-NN) model and improve its efficiency.

# Remove unnecessary columns (e.g., 'key', 'mode', 'time_signature')
spotify.df <- spotify.df[ , -c(7, 10, 13)]

head(spotify.df)  # Print first few rows to ensure columns were removed

Data Partitioning

The dataset is split into training and validation sets to train and evaluate the k-NN model.

# Set seed for replicability
set.seed(80)

# Partition dataset into training & validation sets
train.index <- sample(c(1:nrow(spotify.df)), nrow(spotify.df)*0.6)
train.df <- spotify.df[train.index, ]
valid.df <- spotify.df[-train.index, ]

Feature Selection

The code calculates the mean values of various attributes for songs George liked and disliked. It identifies variables showing a percentage difference of 10% or more between the two groups and removes those variables from the training data.

train.df.filtered <- train.df %>% group_by(target) %>% summarize(mean(danceability), mean(energy), mean(loudness), mean(speechiness), mean(acousticness), mean(instrumentalness), mean(liveness), mean(tempo), mean(duration_ms))

Identify which variables show a percentage difference of 10% or more? If any variables show less than 10% difference in mean value between the two groups (songs that George likes, and songs that George doesn’t like), then remove those variables entirely.

 # Use 'for' loop to create percent_change object
for (i in 2:ncol(train.df.filtered)) {
  percent_change <- ((train.df.filtered[1, c(2:10)] - train.df.filtered[2, c(2:10)])/train.df.filtered[1, c(2:10)])*100
} 

percent_change  # print percentage difference output
# Return columns (variables) of the 'percent_change' object, where the percentage difference is 10% or more
which(abs(percent_change[ , ]) >= 10) 
[1] 4 5 6

The variables “speechiness” (column 4), “acousticness” (column 5), & “instrumentalness” (column 6) show a percentage difference of 10% or more.

# List variable names in training data so we can remove these 3 variables
names(train.df) 
 [1] "X"                "acousticness"     "danceability"    
 [4] "duration_ms"      "energy"           "instrumentalness"
 [7] "liveness"         "loudness"         "speechiness"     
[10] "tempo"            "valence"          "target"          
[13] "song_title"       "artist"          
train.df <- train.df[ , -c(2, 6, 9)]

It is reasonable to consider removing variables with highly similar values in a k-nearest neighbors (k-NN) model because doing so helps create a more parsimonious model. By reducing the number of predictor variables, we aim to improve the model’s ability to generalize and make accurate predictions when faced with new data.

Moreover, when we have an excess of predictor variables, the model requires a larger number of training records to establish meaningful patterns. Consequently, the time it takes to identify the k-nearest neighbors for new records also increases.

Therefore, striving for a simpler, more concise model can significantly boost its confidence and efficiency when confronted with real-world data.

Data Normalization

The data is normalized using the preProcess function. This step is crucial for k-NN algorithms, as it ensures that all variables contribute equally to distance calculations.

norm.values <- preProcess(train.df[ , 2:7], method = c("center", "scale"))  

# only used values in columns number 2-8 (i.e., for numeric variables only)


# Initializes normalized df's to original df
train.norm.df <- train.df
valid.norm.df <- valid.df
spotify.norm.df <- spotify.df

train.norm.df[ , 2:7] <- predict(norm.values, train.df[ , 2:7])
valid.norm.df[ , c(3:5, 7:8, 10)] <- predict(norm.values, valid.df[ , c(3:5, 7:8, 10)])
spotify.norm.df[ , c(3:5, 7:8, 10)] <- predict(norm.values, spotify.df[ , c(3:5, 7:8, 10)])


song.norm.df <- predict(norm.values, song.df[, c(6:7, 9, 14:15, 17)])

Model Building

Using a k-value of 7, the code generates a predicted classification for the song “Cold Heart.” It also identifies the song’s 7 nearest neighbors.

nn <- knn(train = train.norm.df[ , 2:7], test = song.norm.df, cl = train.norm.df[ , 9], k = 7)

nn
[1] 1
attr(,"nn.index")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  727  170  380  729   41   50 1171
attr(,"nn.dist")
          [,1]      [,2]      [,3]      [,4]     [,5]      [,6]      [,7]
[1,] 0.6417425 0.6606155 0.6767229 0.7279269 0.730869 0.7505163 0.7629197
Levels: 1

The model predicted George would like the song “Cold Heart.” In other words, the model classified the song “Cold Heart” as belonging to the “1” (e.g., songs George will like) class.

The titles, artists, & outcome classes of the k-nearest neighbors are as follows

knn_output <- train.norm.df[c(727, 170,  380,  729,   41,   50, 1171), 9:11]
knn_output

Model Evaluation

The code evaluates the model’s performance for different k-values and determines the optimal k-value, which is found to be 6.

# Initialize df with columns for 'k' & 'accuracy'
accuracy.df <- data.frame(k = seq(1, 20, 1), accuracy = rep(0, 20))

for (i in 1:20) {
  knn.pred <- knn(train.norm.df[ , 2:7], valid.norm.df[ , c(3:5, 7:8, 10)], cl = train.norm.df[ , 9], k = i)
  accuracy.df[i, 2] <- confusionMatrix(knn.pred, valid.norm.df[ , 12])$overall[1]
}

accuracy.df 

The optimal ‘k’ value is 6 because it has the maximum accuracy rate (and minimum error rate) before the accuracy rate starts to decrease again at k=7 & k=8. Though we see that the accuracy rate starts to increase again at k=9, we must consider the bias-variance tradeoff: as k increases, the model starts to approximate the training set and would predict the majority class in the data set for all case scenarios.

Visualization

A scatterplot is created to visualize the relationship between k-values and accuracy.

library(ggplot2)

ggplot(data = accuracy.df, aes(x= k, y= accuracy)) + geom_point() + labs(title = "Accuracy vs. K", x= "K", y= "Accuracy")

Final Model & Prediction

The k-NN model is rerun with the optimal k-value, and predictions are made for the selected song. The model predicts that George will like the song “Cold Heart,” with the nearest neighbors identified.

nn.new <- knn(train = spotify.norm.df[ , c(3:5, 7:8, 10)], test = song.norm.df, cl = spotify.norm.df[ , 12], k = 6)

nn.new
[1] 1
attr(,"nn.index")
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]  454  163  520  372  474 1398
attr(,"nn.dist")
          [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
[1,] 0.5804036 0.6417425 0.6606155 0.6767229 0.7012645 0.7279269
Levels: 1

The model predicted George would like the song “Cold Heart.” In other words, the model classified the song “Cold Heart” as belonging to the “1” (e.g., songs George will like) class.

knn_new_output <- spotify.norm.df[c(454,  163,  520,  372,  474, 1398), 12:14]
knn_new_output

The final result is the same as when I first ran the k-nn function, but some of the “k-nearest neighbors” changed. The songs “Forever”, “Can I Kick It?”, “Mamacita”, & “ê·\200ê°\200 Return Home” were four of seven k-nearest neighbors in the first example, while “Do It Roger” and “Bizness” are new k-nearest neighbors (e.g., the first iteration of knn included 2 songs that differ from the latter 2, and 1 more additional song).

Conclusions (k-nn)

In the following section, we delve into the key insights gleaned from the analysis and explore how these insights can be leveraged by various stakeholders. From predicting song likability to optimizing model parameters, these findings offer valuable guidance for enhancing decision-making and user experiences in the dynamic realm of music and data analysis. Let’s uncover the implications of this analysis for different stakeholders in the music industry and beyond.

Key Insights
  • The k-NN model predicts that George will like the song “Cold Heart” by Elton John, Dua Lipa, and PNAU.

  • The model identified the nearest neighbors of the song, which includes songs such as “Do It Roger,” “Forever,” and “Mamacita.”

  • The optimal k-value for the model is determined to be 6, based on evaluation results.

  • Dimension reduction was performed by removing variables with low impact on the model, which can lead to a more efficient and interpretable model.

  • Data normalization was applied to ensure that all variables contribute equally to the k-NN distance calculations.

Stakeholder Use
  • Music Recommendation Platforms: Music recommendation platforms like Spotify can use this analysis to enhance their recommendation algorithms. By predicting whether a user will like a song based on their listening history and song attributes, they can improve user engagement.

  • Record Labels: Record labels can benefit from understanding which attributes of a song contribute to its likability. This information can inform their music production and promotion strategies.

  • Artists: Artists can gain insights into what aspects of their music are more likely to resonate with their audience. This knowledge can help them create music that aligns with their fans’ preferences.

  • Data Analysts: Data analysts and data scientists can use this analysis as a template for building predictive models on similar datasets. The code and process serve as a practical example of data preprocessing, feature selection, and model evaluation.

  • Music Researchers: Researchers in the field of music analysis and psychology can use the findings to study the relationship between song attributes and listener preferences, contributing to the understanding of music psychology.

In summary, the analysis provides valuable insights into predicting song likability and can be utilized by various stakeholders in the music industry and beyond to enhance decision-making and user experiences.

---
title: "Data-Driven Predictive Analytics: Salaries and Song Preferences"
author: Tara Cool
output: html_notebook
---

# **INTRODUCTION**

*In this R project, we will embark on a two-phase data analysis journey. The initial phase centers on the application of simple linear regression models, followed by the second phase, which involves the utilization of k-nearest neighbors modeling. The objective is to investigate datasets pertaining to tech industry salaries and Spotify music information, with the ultimate aim of deriving meaningful insights and generating predictive outcomes. To initiate the process, we will load the essential libraries and ready the data for analysis.*



Load necessary packages:
```{r message= FALSE}
library(ggplot2)
library(forecast)
library(dplyr)
library(caret)
library(FNN)
```





# **PART I - SIMPLE LINEAR REGRESSION**

*In the first part of our analysis, we delve into the world of Simple Linear Regression (SLR). Our journey begins by preparing the data, exploring its nuances, and understanding the relationships between variables. We aim to uncover insights into the tech industry's compensation dynamics and build a predictive model that can estimate total yearly compensation based on years of experience. Through careful examination and statistical analysis, we shed light on the correlation between these factors and unveil the model's performance in predicting compensation. Join us as we navigate the intricacies of SLR to gain valuable insights that benefit HR professionals, hiring managers, and tech industry employees.*



#### **1. Data Preparation**

*The initial step involved loading the necessary packages and importing the dataset, "tech_salary_dataset.csv," into an R data frame named "salary.df." This ensured that our data was ready for further analysis.*

```{r}
# Load dataset
salary.df <- read.csv("tech_salary_dataset.csv")

head(salary.df)  # Return first few rows 
```



#### **Data Exploration**

*Our data exploration phase included an examination of the dataset's structure, identifying both numeric and categorical variables. This section provided an overview of the dataset's key characteristics.*

```{r}
# View structure of dataset
str(salary.df)
```
The numerical variables in the dataset include: total yearly compensation, total years of experience, years at the company, base salary, stock grant value and bonus. 

On the other hand, the categorical variables in the dataset include: timestamp, company, level, title, location, tag, gender, other details, cityid, dmaid, row number, masters degree, bachelors degree, doctorate degree, highschool, some college, race, and education.



#### **Data Partitioning**

*Before conducting deeper analysis, the data was partitioned into training and validation sets. This crucial step ensures the model's relevance and accuracy.*

```{r}
# Set seed for replicability of sample/output
set.seed(80)  

# Sample 60% of dataset & assign it to index for training set
train.index <- sample(row.names(salary.df), 0.6*dim(salary.df)[1])

# Assign remaining rows to index for validation set
valid.index <- setdiff(row.names(salary.df), train.index)  

# Subset all of the rows for all the columns associated with the training data/validation data indexes
train.df <- salary.df[train.index, ]
valid.df <- salary.df[valid.index, ]
```


It is important to partition the data before doing any sort of in-depth analysis of the variables in order to ensure the model is relevant to the data (i.e., the training data, which is used to fit the model); and performs accurately in classifying/predicting new records (i.e., using validation data to ensure the model was not "overfit" to the training data). 



#### **EDA - Data Visualization + Summary Stats**

*Exploratory data analysis involved data visualization and summary statistics. We created scatterplots, calculated correlations, and interpreted the relationships between variables.*

```{r message= FALSE}
# Create a scatterplot of yearsofexperience vs. totalyearlycompensation
ggplot(data= train.df, aes(x= yearsofexperience, y= totalyearlycompensation)) + geom_point() + labs(title = "Yearly Compensation vs. Years of Experience", x= "Years of Experience", y= "Total Yearly Compensation") + geom_smooth(method = "lm")

# the geom_smooth() function allows us to add a best-fit line
```


This plot suggests that the two variables have a moderate positive correlation. This is evident based on the close proximity of the points to the best-fit line (i.e., given a few potential outliers) and suggests that total yearly compensation will increase as a tech employee's number of years of experience in the industry increases. Ultimately, this relationship makes intuitive sense to me because typically the longer you stay in a certain industry (e.g., in a certain position or at a certain company), the more knowledgeable/qualified you will be. As a consequence, more likely you are to receive a higher compensation in the following year/period -- due to experience. Likewise, it is not usually common for an employee's salary to decrease over time, especially when they are holding the same position/gaining more experience over time.



```{r}
# Calculate the correlation between yearsofexperience & totalyearlycompensation
cor(x= train.df$yearsofexperience, y= train.df$totalyearlycompensation)
```


```{r}
# Compute significance of correlation
cor.test(x= train.df$yearsofexperience, y= train.df$totalyearlycompensation)
```


The correlation observed between "yearsofexperience" and "totalyearlycompensation" in the training dataset is approximately 0.4263. This value suggests a moderate positive correlation, indicating that as years of experience increase, total yearly compensation tends to rise as well.

When subjected to statistical analysis using Pearson's correlation test, the results confirm the significance of this relationship. The remarkably low p-value (less than 2.2e-16) unequivocally demonstrates that the observed correlation is highly meaningful. It provides compelling evidence that the connection between years of experience and total yearly compensation is not merely a random occurrence.

Furthermore, the 95 percent confidence interval for the correlation, ranging from 0.4180 to 0.4346, narrows down the likely range within which the true correlation resides. This not only reinforces the concept of a substantial and consistent link between these two variables but also enhances our confidence in the findings.

While the correlation may not be exceptionally strong, it is statistically significant and highlights a meaningful connection within the dataset. In essence, as individuals accumulate more years of experience, their total yearly compensation tends to exhibit a discernible upward trend, as substantiated by the data and statistical analysis.



#### **Linear Regression Model**

*In this next step a simple linear regression model is constructed to predict total yearly compensation based on years of experience. The model's summary and a hypothetical input prediction were included to illustrate its functionality.*


```{r}
# Create SLR model
model <- lm(totalyearlycompensation ~ yearsofexperience, data = train.df)

# Generate model summary
summary(model)
```



#### **Model Evaluation**

*In the following section, we evaluated the model's performance using residuals analysis, highlighting instances where the model's predictions deviated significantly from actual values.*

```{r}
# Generate summary of model residuals
summary(model$residuals)
```
In this model, the minimum residual value is -744,515 & the maximum residual value is 4,663,543. 



```{r}
# Create df with fitted values (predictions) & residuals
residual.df <- data.frame("Residuals" = model$residuals, "Predicted Total Comp" = model$fitted.values)

# Bind the 'residual.df' to the df with our training data
train.df <- cbind(train.df, residual.df)

# Returns first few rows of data to ensure successful binding of the 2 dataframes
head(train.df)  
```

```{r}
# Identify the observation whose rating generated the highest residual value
subset(train.df, Residuals == max(train.df$Residuals))
```
Based on the observation whose rating generated the highest residual value, the person's actual total compensation was 4,980,000 (dollars); while the model predicted it would be 316,456.80 (dollars). The residual is calculated by taking the difference between the actual total compensation for each employee and what the regression model predicted it to be (i.e., 4980000 - 316456.80 = 4663543.30). 



```{r}
# Identify the observation whose rating generated the lowest residual value
subset(train.df, Residuals == min(train.df$Residuals))
```
Based on the observation whose rating generated the lowest residual value, the person's actual total compensation was 102,000 (dollars); while the model predicted it would be 846,515.10 (dollars). The residual is calculated by taking the difference between the actual total compensation for each employee and what the regression model predicted it to be (i.e., 102000 - 846515.10 = -744,515.10). 


It looks like there are some cases where this model is quite a bit "off the mark". Although intuitively it makes sense that years of experience could be a factor in the total compensation an employee in the high tech industry receives, there a few possible reasons why years of experience might not closely predict total compensation. Specifically, it is possible that "years of experience" is correlated with one or more of the other input variables in the model, which exposes the model to the dangers of multi-collinearity. There is also the case where variable reduction is the focus; in which there are just "better" input variables that act as predictors for total compensation -- with years of experience being one of the potential variables under the scope for elimination/being "dropped." For example, based on the description of each of the variables in the original dataset, we know that total yearly compensation is the sum of several other numerical values - including base salary, bonus, & stock grant value.



#### **Summary of Linear Regression Equation**

```{r}
summary(model)  # Return model summary to obtain coefficients (m, slope & b, y-intercept)
```
Therefore, the regression equation generated by the model is: Total yearly compensation (y) = 10193.4(x) + 143168.5


Make up a hypothetical input value and use it to make a new prediction.
```{r}
new_input <- 10

total_yearly_compensation <- function(x){
  return(10193.4*x + 143168.5)
}
total_yearly_compensation(10)
```
In sum, '10' was used as the hypothetical input value for "yearsofexperience" in the regression equation to predict as an outcome, the total yearly compensation of an employee with 10 years of experience in the high tech industry. As such, for an employee who has 10 years of experience in the high tech industry, the predicted total yearly compensation is 245,102.50 (dollars).


Training set:
```{r}
accuracy(train.df$Predicted.Total.Comp, train.df$totalyearlycompensation)
```

Validation Set:
```{r}
salary.pred <- predict(model, valid.df)
accuracy(salary.pred, valid.df$totalyearlycompensation)
```
The purpose of making this comparison is to ensure the model performs well in predicting new records (i.e., evaluating its predictive accuracy). This means that the model is not overfit to the predictor values in the training set (i.e., when applied to new, unknown records). 

In particular, the root-mean-squared error (RMSE) tells us about the "error magnitude" in units that correspond to the output (e.g., total yearly compensation); while the mean absolute error (MAE) tells us about the magnitude of the average absolute error. These prediction accuracy measures corresponding to the training data tell us about model fit when comparing the predicted/actual values of total yearly compensation; while the prediction accuracy measures that correspond to the validation set communicate the model's ability to predict new records that are unfamiliar to the model/training data. In general, RMSE/MAE should be lower for the training data than the validation data, as the training data was used to fit the regression model (i.e., with low error magnitude indicating better fit).

However, even with low RMSE and MAE on training data, bias may persist - systematic errors that consistently push predictions in one direction. To detect bias, the RMSE and MAE are evaluated on the validation set. If they significantly differ from training data, it may indicate bias. In essence, this comparison ensures the model's generalization and detects potential bias that might not be apparent solely from training data.



#### **Model Comparison**

*Finally, we compared the model's root-mean-squared error (RMSE) to the standard deviation of total compensation in the training set, providing insights into its performance.*

```{r}
sd(train.df$totalyearlycompensation)
```

The model's RMSE is less than the standard deviation of total compensation in the training set. Comparing our model's RMSE to the standard deviation of total compensation reveals key insights into its performance. 

In essence, when the RMSE exceeds or equals the standard deviation, it signals that our model's predictions are no better than a "null" model that merely guesses the mean. This suggests the model isn't effectively learning from the input features and lacks meaningful predictive power.

Conversely, a lower RMSE compared to the standard deviation indicates that the model is outperforming this basic baseline. It demonstrates the model's ability to capture valuable patterns and make more accurate predictions.

In summary, this comparison not only gauges predictive accuracy but also highlights whether the model adds value beyond a simplistic approach.



#### **Conclusions (SLR)**

In this section, we distill the key insights gained from our analysis and discuss their relevance to various stakeholders. These insights shed light on the relationship between years of experience and total yearly compensation in the tech industry and offer practical implications for decision-makers.

##### **Key Insights**

* *Moderate Positive Correlation*: Our analysis revealed a moderate positive correlation between years of experience and total yearly compensation, indicating that, on average, employees earn higher compensation as they gain more experience.

* *Statistical Significance*: Statistical tests confirmed the significance of this correlation, providing robust evidence that the relationship is not random.

* *Linear Regression Model*: The linear regression model provided a quantitative representation of the relationship, facilitating salary predictions based on years of experience.

* *Model Performance*: Our model demonstrated good performance in predicting compensation, as indicated by low RMSE and MAE values on both training and validation datasets.

* *Bias Detection*: While generally performing well, the model exhibited bias in some cases, highlighting areas for potential improvement.

* *Model Value*: Comparing the RMSE to the standard deviation demonstrated that the model adds value by making more accurate predictions than a mean-based model.

##### **Stakeholder Implications**

* *HR Departments*: HR professionals can utilize the linear regression model to estimate compensation for potential hires, aiding in competitive job offers.

* *Hiring Managers*: Understanding the relationship between experience and compensation equips hiring managers to make informed decisions during the hiring process.

* *Employees*: Current employees can gain insights into their potential career trajectories and financial goals based on how their compensation may evolve.

* *Data Analysts*: The provided code and process serve as a template for data analysts and data scientists to conduct similar analyses in their organizations, exploring relationships between variables and building predictive models.

In conclusion, this analysis provides valuable insights into tech industry compensation dynamics and offers a practical model for predicting salaries based on experience, benefiting a range of stakeholders in decision-making roles.





# **PART II - K-NEAREST NEIGHBORS**

*In the second part of our analysis, we venture into the realm of K-Nearest Neighbors (k-NN) modeling. Our exploration begins by selecting a song from the Spotify Top 200 Weekly Charts of 2022 and dissecting its attributes. We leverage k-NN to predict whether this song will be liked by a user, George, and uncover its nearest musical companions. Along the way, we optimize the k-value for our model, reduce dimensionality for efficiency, and normalize the data for equitable comparisons. This section not only showcases the power of k-NN in music recommendation but also serves as a blueprint for data analysts and stakeholders aiming to enhance user experiences and decision-making in the dynamic world of music data analysis. Join us as we harmonize data science and music to predict likability and uncover the secrets of k-NN.*



#### **Load Data**

*The code begins by loading the 'spotify_top_charts_22.csv' dataset, which contains information about songs that appeared on the Top 200 Weekly (Global) charts of Spotify in 2022. Each row represents a specific song.*

```{r}
top.charts.df <- read.csv("spotify_top_charts_22.csv")

top.charts.df 
```



#### **Song Selection**

*First, we will select a song for analysis.*

I chose the song "Cold Heart" by Elton John, Dua Lipa, & PNAU (row 6).
```{r}
top.charts.df[6, ]  # returns the row for the song "Cold Heart"
```



#### **Song Information**

*Various attributes of the selected song are queried, such as danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, tempo, and duration_ms.*

  danceability:
```{r}
top.charts.df[6, "danceability"]
```
  
  energy:
```{r}
top.charts.df[6, "energy"]
```
  

  loudness:
```{r}
top.charts.df[6, "loudness"]
```
  

  speechiness:
```{r}
top.charts.df[6, "speechiness"]
```
  

  acousticness:
```{r}
top.charts.df[6, "acousticness"]
```
  

  instrumentalness:
```{r}
options(scipen = 999)
top.charts.df[6, "instrumentalness"]
```
  

  liveness:
```{r}
top.charts.df[6, "liveness"]
```
  

  tempo:
```{r}
top.charts.df[6, "tempo"]
```
  

  duration_ms:
```{r}
top.charts.df[6, "duration_ms"]
```
  


#### **Data Exploration**

*In this step, we explore the datasets & examine the target variable.*

```{r}
# Subset selected song
song.df <- top.charts.df[6, ]

# Examine structure of dataset
str(song.df)  
```


#### **Read Additional Dataset**

*A second dataset, 'spotify_data.csv,' is read into the environment, and its structure is examined. It contains ratings of various songs, with the 'target' variable indicating whether George liked the song (1 for liked, 0 for disliked).*

```{r}
# Read dataset containing George's ratings of various songs into environment
spotify.df <- read.csv("spotify_data.csv")

# Examine structure of dataset
str(spotify.df)
```



The variable 'target' is a categorical variable as it tells us whether or not George liked the song (i.e., 1 = yes he liked the song, 0 = he did not like the song). Since the categories are communicated using binary dummy variables, the data structure is integer or numeric form. 


If target is not currently a factor, convert it into a factor. It will be our response variable in this model. Target tells us whether George, the person who uploaded this dataset, liked the song. “1” means that George liked it, and “0” means that he did not.
```{r}
spotify.df$target <- as.factor(spotify.df$target)
```



```{r}
# Return table with unique values & the total count of observations corresponding to each of those unique values
table(spotify.df$target)  
```
The unique values are "0" (i.e., George did not like the song) & "1" (i.e., George did like the song). There are 997 songs he disliked and 1020 that he liked.



#### **Handling Missing Values**

*Next, we check for missing values in the dataset and find that there are none.*

```{r}
cat("There are", sum(is.na(spotify.df)), "missing values in the dataset.")
```


#### **Dimension Reduction**

*Unnecessary columns (e.g., 'key', 'mode', 'time_signature') are removed from the dataset, aiming to simplify the k-nearest neighbors (k-NN) model and improve its efficiency.*

```{r}
# Remove unnecessary columns (e.g., 'key', 'mode', 'time_signature')
spotify.df <- spotify.df[ , -c(7, 10, 13)]

head(spotify.df)  # Print first few rows to ensure columns were removed
```



#### **Data Partitioning**

*The dataset is split into training and validation sets to train and evaluate the k-NN model.*

```{r}
# Set seed for replicability
set.seed(80)

# Partition dataset into training & validation sets
train.index <- sample(c(1:nrow(spotify.df)), nrow(spotify.df)*0.6)
train.df <- spotify.df[train.index, ]
valid.df <- spotify.df[-train.index, ]
```



#### **Feature Selection**

*The code calculates the mean values of various attributes for songs George liked and disliked. It identifies variables showing a percentage difference of 10% or more between the two groups and removes those variables from the training data.* 

```{r}
train.df.filtered <- train.df %>% group_by(target) %>% summarize(mean(danceability), mean(energy), mean(loudness), mean(speechiness), mean(acousticness), mean(instrumentalness), mean(liveness), mean(tempo), mean(duration_ms))
```



*Identify which variables show a percentage difference of 10% or more? If any variables show less than 10% difference in mean value between the two groups (songs that George likes, and songs that George doesn’t like), then remove those variables entirely.*

```{r}
 # Use 'for' loop to create percent_change object
for (i in 2:ncol(train.df.filtered)) {
  percent_change <- ((train.df.filtered[1, c(2:10)] - train.df.filtered[2, c(2:10)])/train.df.filtered[1, c(2:10)])*100
} 

percent_change  # print percentage difference output
```

```{r}
# Return columns (variables) of the 'percent_change' object, where the percentage difference is 10% or more
which(abs(percent_change[ , ]) >= 10) 
```
The variables "speechiness" (column 4), "acousticness" (column 5), & "instrumentalness" (column 6) show a percentage difference of 10% or more.


```{r}
# List variable names in training data so we can remove these 3 variables
names(train.df) 
```

```{r}
train.df <- train.df[ , -c(2, 6, 9)]
```



It is reasonable to consider removing variables with highly similar values in a k-nearest neighbors (k-NN) model because doing so helps create a more parsimonious model. By reducing the number of predictor variables, we aim to improve the model's ability to generalize and make accurate predictions when faced with new data. 

Moreover, when we have an excess of predictor variables, the model requires a larger number of training records to establish meaningful patterns. Consequently, the time it takes to identify the k-nearest neighbors for new records also increases. 

Therefore, striving for a simpler, more concise model can significantly boost its confidence and efficiency when confronted with real-world data.



#### **Data Normalization**

*The data is normalized using the preProcess function. This step is crucial for k-NN algorithms, as it ensures that all variables contribute equally to distance calculations.*

```{r}
norm.values <- preProcess(train.df[ , 2:7], method = c("center", "scale"))  

# only used values in columns number 2-8 (i.e., for numeric variables only)


# Initializes normalized df's to original df
train.norm.df <- train.df
valid.norm.df <- valid.df
spotify.norm.df <- spotify.df

train.norm.df[ , 2:7] <- predict(norm.values, train.df[ , 2:7])
valid.norm.df[ , c(3:5, 7:8, 10)] <- predict(norm.values, valid.df[ , c(3:5, 7:8, 10)])
spotify.norm.df[ , c(3:5, 7:8, 10)] <- predict(norm.values, spotify.df[ , c(3:5, 7:8, 10)])


song.norm.df <- predict(norm.values, song.df[, c(6:7, 9, 14:15, 17)])
```



#### **Model Building**

*Using a k-value of 7, the code generates a predicted classification for the song "Cold Heart." It also identifies the song's 7 nearest neighbors.*

```{r}
nn <- knn(train = train.norm.df[ , 2:7], test = song.norm.df, cl = train.norm.df[ , 9], k = 7)

nn
```
The model predicted George would like the song "Cold Heart." In other words, the model classified the song "Cold Heart" as belonging to the "1" (e.g., songs George will like) class. 

The titles, artists, & outcome classes of the k-nearest neighbors are as follows
```{r}
knn_output <- train.norm.df[c(727, 170,  380,  729,   41,   50, 1171), 9:11]
knn_output
```



#### **Model Evaluation**

*The code evaluates the model's performance for different k-values and determines the optimal k-value, which is found to be 6.*

```{r}
# Initialize df with columns for 'k' & 'accuracy'
accuracy.df <- data.frame(k = seq(1, 20, 1), accuracy = rep(0, 20))

for (i in 1:20) {
  knn.pred <- knn(train.norm.df[ , 2:7], valid.norm.df[ , c(3:5, 7:8, 10)], cl = train.norm.df[ , 9], k = i)
  accuracy.df[i, 2] <- confusionMatrix(knn.pred, valid.norm.df[ , 12])$overall[1]
}

accuracy.df 
```
The optimal 'k' value is 6 because it has the maximum accuracy rate (and minimum error rate) before the accuracy rate starts to decrease again at k=7 & k=8. Though we see that the accuracy rate starts to increase again at k=9, we must consider the bias-variance tradeoff: as k increases, the model starts to approximate the training set and would predict the majority class in the data set for all case scenarios.



#### **Visualization**

*A scatterplot is created to visualize the relationship between k-values and accuracy.*

```{r}
library(ggplot2)

ggplot(data = accuracy.df, aes(x= k, y= accuracy)) + geom_point() + labs(title = "Accuracy vs. K", x= "K", y= "Accuracy")
```



#### **Final Model & Prediction**

*The k-NN model is rerun with the optimal k-value, and predictions are made for the selected song. The model predicts that George will like the song "Cold Heart," with the nearest neighbors identified.*

```{r}
nn.new <- knn(train = spotify.norm.df[ , c(3:5, 7:8, 10)], test = song.norm.df, cl = spotify.norm.df[ , 12], k = 6)

nn.new
```

The model predicted George would like the song "Cold Heart." In other words, the model classified the song "Cold Heart" as belonging to the "1" (e.g., songs George will like) class. 


```{r}
knn_new_output <- spotify.norm.df[c(454,  163,  520,  372,  474, 1398), 12:14]
knn_new_output
```

The final result is the same as when I first ran the k-nn function, but some of the "k-nearest neighbors" changed. The songs "Forever", "Can I  Kick It?", "Mamacita", & "ê·\200ê°\200 Return Home" were four of seven k-nearest neighbors in the first example, while "Do It Roger" and "Bizness" are new k-nearest neighbors (e.g., the first iteration of knn included 2 songs that differ from the latter 2, and 1 more additional song). 



#### **Conclusions (k-nn)**

In the following section, we delve into the key insights gleaned from the analysis and explore how these insights can be leveraged by various stakeholders. From predicting song likability to optimizing model parameters, these findings offer valuable guidance for enhancing decision-making and user experiences in the dynamic realm of music and data analysis. Let's uncover the implications of this analysis for different stakeholders in the music industry and beyond.

##### **Key Insights**

* The k-NN model predicts that George will like the song "Cold Heart" by Elton John, Dua Lipa, and PNAU.

* The model identified the nearest neighbors of the song, which includes songs such as "Do It Roger," "Forever," and "Mamacita."

* The optimal k-value for the model is determined to be 6, based on evaluation results.

* Dimension reduction was performed by removing variables with low impact on the model, which can lead to a more efficient and interpretable model.

* Data normalization was applied to ensure that all variables contribute equally to the k-NN distance calculations.

##### **Stakeholder Use**

* *Music Recommendation Platforms*: Music recommendation platforms like Spotify can use this analysis to enhance their recommendation algorithms. By predicting whether a user will like a song based on their listening history and song attributes, they can improve user engagement.

* *Record Labels*: Record labels can benefit from understanding which attributes of a song contribute to its likability. This information can inform their music production and promotion strategies.

* *Artists*: Artists can gain insights into what aspects of their music are more likely to resonate with their audience. This knowledge can help them create music that aligns with their fans' preferences.

* *Data Analysts*: Data analysts and data scientists can use this analysis as a template for building predictive models on similar datasets. The code and process serve as a practical example of data preprocessing, feature selection, and model evaluation.

* *Music Researchers*: Researchers in the field of music analysis and psychology can use the findings to study the relationship between song attributes and listener preferences, contributing to the understanding of music psychology.

In summary, the analysis provides valuable insights into predicting song likability and can be utilized by various stakeholders in the music industry and beyond to enhance decision-making and user experiences.


