This project utilizes Plotly Graphics using Spotify data that was gathered via their API and chart downloads from their top charts site. To download the data for this project, run the DownloadTop200.R, ReadData.R, and the APIpull.R scripts from my github page here: https://github.com/KKONZ/SpotifyR. The directory names and locations will need to be changed and the SpotifyData folder needs to be empty so that the downloaded chart data can be downloaded there and rolled up into one dataset. Also you will need to register a Spotify App via their developer site and replace the credentials in the APIpull.R script.
# Set the seed to reproduce the same results as shown
# in this script
set.seed(654321)
# Read Data
SpotifyData <- read.csv("C:/Users/karlk/Desktop/SpotifyData.csv", header = TRUE)
# Load the caret package
library(caret)
library(randomForest)
library(plotly)
# Use the createDataPartition function to split the data into training and testing data sets
# 60% of the data will be used for training the data and 40% will be used to test the model
InTrain <- createDataPartition(y=log(SpotifyData$SumChartStreams), p = .60, list=FALSE)
training <- SpotifyData[InTrain, ]
testing <- SpotifyData[-InTrain, ]
The next step will be to start modeling the data. It will be advantageous to do some sort of testing for feature importance. Instead of using AIC or BIC stepwise selections for determining what variables to include in the model random forest will be used to make those decisions. With this ensemble learning model, repeated cross validation will be used by using 5 folds in the data and will be repeated 5 times to avoid over fitting the importance scores.
# Setup the repeated cross validation for the random forest model that will be used to
# find the important features to use in the model
control <- trainControl(method="repeatedcv", number=5, repeats=5)
fitImportance <- train(log(SumChartStreams) ~ liveness + danceability +
energy + key + loudness + mode +
speechiness + acousticness + instrumentalness +
valence + tempo + log(as.numeric(duration_ms)) +
DateIndex + time_signature +
MaxPosition + minPosition + log(CountOnCharts) +
DateIndex +TrackPopularity, data=training
, method="rf", importance = TRUE,
preProcess="scale",
trControl=control)
The importance ratings are determined by calculating the MSE on out of bag data for each of the trees, then MSE is computed after permuting a variable. The difference from the 2 accuracies are then averaged and normalized by the standard deviation and scaled to be between 0 and 100. Below we can see that the log count of days on the charts, the minimum position, date index, max position, and track popularity all have the highest importance scores for predicting the sum of streams while the track was on the charts.
# Create an importance object from the random forest model
importance <- varImp(fitImportance)
importance1 <- data.frame(row.names(importance$importance), importance$importance[ ,1])
names(importance1) <- c("TrackName", "Overall Importance")
importance2 <- data.frame(fitImportance$finalModel$importanceSD)
importance2 <- data.frame(row.names(importance2), importance2[ ,1])
names(importance2) <- c("TrackName", "ImportanceSD")
mergeimportance <- merge(importance1, importance2, by="TrackName")
mergeimportance$text <- paste("Importance SD: ", round(mergeimportance[, 3], 4), sep="")
mergeimportance <- mergeimportance[order(mergeimportance[, 2]), ]
rank <- seq(from = nrow(mergeimportance), to = 1, by = -1)
mergeimportance <- cbind(mergeimportance, rank)
mergeimportance$text <- paste("Feature: ", mergeimportance[ ,1], ", Ranked #", mergeimportance$rank, sep="")
# Display the importance scores
x <- paste(LETTERS[mergeimportance$rank])
y <- mergeimportance[ ,2]
text <- mergeimportance$text
data <- data.frame(x, y, text)
plot_ly(data, x = ~x, y = ~y, type = 'bar', text = text,
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))) %>%
layout(title = "Random Forest Overall Importance Score",
xaxis = list(title = ""),
yaxis = list(title = "")
)
The illustration above shows that maxPosition and TrackPopularity have moderate correlations with other independent variables. Utilizing Occam’s razor theory, including those variables will likely obfuscate inference and since they likely won’t increase the goodness of fit with the model, they will be omitted so that the final model will be more interpretable.
# Fit an OLS model with the important features from the fitImportance random forest model
fit <- lm(log(SumChartStreams) ~
minPosition +
DateIndex +
log(CountOnCharts),
data = training)
# View the summary statistics from the OLS model created above
summary(fit)
##
## Call:
## lm(formula = log(SumChartStreams) ~ minPosition + DateIndex +
## log(CountOnCharts), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83973 -0.17969 -0.03589 0.15073 1.75074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.369e+01 3.362e-02 407.33 <2e-16 ***
## minPosition -9.372e-03 1.821e-04 -51.47 <2e-16 ***
## DateIndex 1.244e-03 4.729e-05 26.32 <2e-16 ***
## log(CountOnCharts) 9.196e-01 5.614e-03 163.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2943 on 1268 degrees of freedom
## Multiple R-squared: 0.9807, Adjusted R-squared: 0.9806
## F-statistic: 2.147e+04 on 3 and 1268 DF, p-value: < 2.2e-16
anova(fit)
## Analysis of Variance Table
##
## Response: log(SumChartStreams)
## Df Sum Sq Mean Sq F value Pr(>F)
## minPosition 1 3244.9 3244.9 37457.74 < 2.2e-16 ***
## DateIndex 1 9.8 9.8 113.29 < 2.2e-16 ***
## log(CountOnCharts) 1 2324.0 2324.0 26827.09 < 2.2e-16 ***
## Residuals 1268 109.8 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally to get a an idea of the accuracy of the model, the predicted estimates for log(SumOnCharts) are plotted against the actual observed values for both the training and testing data sets below. If testing predicted values are not very close to the observed values we can deduce that the model is overfitting the data.
testing$Predicted <- predict(fit, testing)
training$Predicted <- predict(fit, training)
# Plot actual log count of total number of streams by predicted log total number of streams
# Shading is by DateIndex, those with higher values = older = solid
# lower values = newer = transparent
# Color is determined by the Min Position
# Size is determined by the absolute residual value
testing$Actual <- log(testing$SumChartStreams)
testing$Resid <- testing$Actual - testing$Predicted
testing$Resid <- abs(testing$Resid)
testing$HighestRank <- testing$minPosition
testing$DebutDate <- as.character(testing$Date)
#str(testing)
p <- ggplot(data = testing, aes(y = Actual, x = Predicted
, colour = HighestRank
, size = Resid
, alpha = DateIndex)) +
geom_point(aes(text = paste(" Artist: ", ArtistName, ", Track: ", Track.Name, sep = "")))
## Warning: Ignoring unknown aesthetics: text
ggplotly(p)
Testing data looks pretty good! For comparisons sake let’s check out a similar report with the training data:
training$Actual <- log(training$SumChartStreams)
training$Resid <- training$Actual - training$Predicted
training$Resid <- abs(training$Resid)
training$HighestRank <- training$minPosition
training$DebutDate <- as.character(training$Date)
#str(testing)
p <- ggplot(data = training, aes(y = Actual, x = Predicted
, colour = HighestRank
, size = Resid
, alpha = DateIndex)) +
geom_point(aes(text = paste(" Artist: ", ArtistName, ", Track: ", Track.Name, sep = "")))
## Warning: Ignoring unknown aesthetics: text
ggplotly(p)