GitHub - https://github.com/rickysoo/audiobooks
In this exercise, we are going to classify and predict popular audiobooks on Audible.com. The data has previously been sourced and cleaned. About 17.6% of the items have been labeled as “popular” based on certain criteria.
Load libraries.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(broom)
Load audiobook data.
audible <- read.csv('audible_cleaned.csv')
# audible <- audible %>% sample_n(1000)
audible$Popular <- factor(audible$Popular)
Quick check on data.
str(audible)
## 'data.frame': 355109 obs. of 11 variables:
## $ ASIN : chr "0062823485" "B0052OM9XK" "B077XN66F8" "B00C75YMJ6" ...
## $ Author : chr "Paul Stanley" "Simon Pegg" "Sidney Lumet" "Phil Robertson" ...
## $ Category : chr "Arts & Entertainment" "Arts & Entertainment" "Arts & Entertainment" "Arts & Entertainment" ...
## $ Narrator : chr "Sean Pratt" "Simon Pegg" "Richard M. Davidson" "Al Robertson, Phil Robertson" ...
## $ Price : num 20.5 24.5 21 21 53 ...
## $ Rating : num 4 4.5 5 4.5 5 4.5 4.5 4.5 3 4.5 ...
## $ RatingCount : int 78 872 90 805 585 390 1398 60 4 841 ...
## $ Title : chr "Backstage Pass" "Nerd Do Well" "Making Movies" "Happy, Happy, Happy" ...
## $ Length : int 278 290 443 297 1465 471 618 463 2336 1278 ...
## $ DaysReleased: int 8521 5639 8010 6337 6399 7891 6063 8333 8521 6477 ...
## $ Popular : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
table(audible$Popular)
##
## 0 1
## 292539 62570
sum(is.na(audible))
## [1] 0
Explore variables using histograms.
ggplot(audible, aes(x = Price, fill = Popular)) +
geom_histogram(bins = 30)
ggplot(audible, aes(x = RatingCount, fill = Popular)) +
geom_histogram(bins = 30)
RatingCount is found to be skewed so it is log-transformed to make it more normally distributed for regression purpose.
audible$RatingCountLog <- log(audible$RatingCount)
ggplot(audible, aes(x = RatingCountLog, fill = Popular)) +
geom_histogram(bins = 30)
ggplot(audible, aes(x = Length, fill = Popular)) +
geom_histogram(bins = 30)
Length is found to be skewed so it is log-transformed to make it more normally distributed for regression purpose.
audible$LengthLog <- log(audible$Length)
ggplot(audible, aes(x = LengthLog, fill = Popular)) +
geom_histogram(bins = 30)
ggplot(audible, aes(x = DaysReleased, fill = Popular)) +
geom_histogram(bins = 30)
DaysReleased is found to be skewed so it is log-transformed to make it more normally distributed for regression purpose.
audible$DaysReleasedLog <- log(max(audible$DaysReleased) + 1 - audible$DaysReleased)
ggplot(audible, aes(x = DaysReleasedLog, fill = Popular)) +
geom_histogram(bins = 30)
Explore the data using bar plot.
ggplot(audible, aes(y = Category, fill = Popular)) +
geom_bar(stat = 'count')
Self-development books are found to have a high proportion of popular books. A new variable SelfDevelopment is added to possibly capture this pattern in model training.
audible$SelfDevelopment <- ifelse(audible$Category == 'Self Development', 1, 0)
table(audible$SelfDevelopment)
##
## 0 1
## 323008 32101
audible_model <- audible[c('Price', 'RatingCountLog', 'LengthLog', 'DaysReleasedLog', 'SelfDevelopment', 'Popular')]
Split data for training. 80% for training set, 20% for test set.
set.seed(88)
samples <- createDataPartition(audible_model$Popular, p = 0.8, list = FALSE)
train.data <- audible_model[samples, ]
test.data <- audible_model[-samples, ]
Train a logistic regression model.
model <- glm(Popular ~ Price + RatingCountLog + LengthLog + DaysReleasedLog + SelfDevelopment, data = train.data, family = binomial)
summary(model)
##
## Call:
## glm(formula = Popular ~ Price + RatingCountLog + LengthLog +
## DaysReleasedLog + SelfDevelopment, family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3669 -0.6340 -0.4155 -0.2215 3.7484
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.235997 0.044952 71.99 <2e-16 ***
## Price -0.011608 0.000666 -17.43 <2e-16 ***
## RatingCountLog -0.439294 0.003762 -116.78 <2e-16 ***
## LengthLog -0.059022 0.005420 -10.89 <2e-16 ***
## DaysReleasedLog -0.484566 0.004506 -107.54 <2e-16 ***
## SelfDevelopment 0.396757 0.016404 24.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 264532 on 284087 degrees of freedom
## Residual deviance: 223195 on 284082 degrees of freedom
## AIC: 223207
##
## Number of Fisher Scoring iterations: 5
All variables are found to be statistically significant in this model.
The coefficients indicate how the presence of the individual variables affect the ‘log odds ratio’ of an item being popular or not. For example, if an item belongs to the self-development category, the log odds ratio of the item being popular is exp(0.4) = 1.49 times larger than when the item does not belong to the category.
Log odds ratio = log(p/(1-p)) where p is the probability of an item being popular being.
coefficients <- summary(model)$coefficients
data.frame(
'Variables' = row.names(coefficients)[2:6],
'Coefficients' = round(coefficients[2:6], 3),
'Log Odds' = paste0(round(exp(coefficients[2:6]), 3), ' times')
)
## Variables Coefficients Log.Odds
## 1 Price -0.012 0.988 times
## 2 RatingCountLog -0.439 0.644 times
## 3 LengthLog -0.059 0.943 times
## 4 DaysReleasedLog -0.485 0.616 times
## 5 SelfDevelopment 0.397 1.487 times
From this table, it can be seen that price has minimal effect on the odd of an item being popular or not. But a self-development book can make the log odds ratio of being popular larger by about 1.49 times.
For the log-transformed variables, one unit of increase means 10^1 = 10 units of increase in the original variable such as the rating counts. In this case, an increase of 10 rating counts can make the log odds ratio smaller to about 0.64 times than before!
Check the accuracy of prediction.
test_probabilities <- predict(model, test.data, type = 'response')
test_popular <- ifelse(test_probabilities > 0.5, 1, 0)
accuracy <- mean(test_popular == test.data$Popular)
accuracy
## [1] 0.8257417
Check the classification matrix.
test_popular <- as.factor(test_popular)
CM_test <- confusionMatrix(test_popular, test.data$Popular, positive = '1')
CM_test
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 57071 10940
## 1 1436 1574
##
## Accuracy : 0.8257
## 95% CI : (0.8229, 0.8285)
## No Information Rate : 0.8238
## P-Value [Acc > NIR] : 0.08769
##
## Kappa : 0.1443
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.12578
## Specificity : 0.97546
## Pos Pred Value : 0.52292
## Neg Pred Value : 0.83914
## Prevalence : 0.17620
## Detection Rate : 0.02216
## Detection Prevalence : 0.04238
## Balanced Accuracy : 0.55062
##
## 'Positive' Class : 1
##
fourfoldplot(CM_test$table)
Show the performance metrics.
CM_test$overall['Accuracy']
## Accuracy
## 0.8257417
CM_test$byClass['Precision']
## Precision
## 0.5229236
CM_test$byClass['Recall']
## Recall
## 0.1257791
While the accuracy is fine over 82%, the precision and recall are low at about 52% and 12% respectively.
It means out of all items predicted as popular, only about 52% are truly labeled as popular, and out of all items labeled as popular, only about 12% are predicted to be popular.
table(test_popular, test.data$Popular)
##
## test_popular 0 1
## 0 57071 10940
## 1 1436 1574
prop.table(table(test_popular, test.data$Popular))
##
## test_popular 0 1
## 0 0.80357922 0.15403895
## 1 0.02021937 0.02216246
Make prediction on all audiobooks.
probabilities <- predict(model, audible, type = 'response')
audible$Predicted <- ifelse(probabilities > 0.5, 1, 0)
accuracy <- mean(audible$Predicted == audible$Popular)
accuracy
## [1] 0.8252959
audible$Predicted <- as.factor(audible$Predicted)
CM <- confusionMatrix(audible$Predicted, audible$Popular, positive = '1')
CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 285441 54941
## 1 7098 7629
##
## Accuracy : 0.8253
## 95% CI : (0.824, 0.8265)
## No Information Rate : 0.8238
## P-Value [Acc > NIR] : 0.009674
##
## Kappa : 0.1396
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.12193
## Specificity : 0.97574
## Pos Pred Value : 0.51803
## Neg Pred Value : 0.83859
## Prevalence : 0.17620
## Detection Rate : 0.02148
## Detection Prevalence : 0.04147
## Balanced Accuracy : 0.54883
##
## 'Positive' Class : 1
##
fourfoldplot(CM$table)
Show the performance metrics.
CM$overall['Accuracy']
## Accuracy
## 0.8252959
CM$byClass['Precision']
## Precision
## 0.5180281
CM$byClass['Recall']
## Recall
## 0.1219274
While the accuracy is fine over 82%, the precision and recall are low at about 52% and 12% respectively.
It means out of all items predicted as popular, only about 52% are truly labeled as popular, and out of all items labeled as popular, only about 12% are predicted to be popular.
table(audible$Predicted, audible$Popular)
##
## 0 1
## 0 285441 54941
## 1 7098 7629
prop.table(table(audible$Predicted, audible$Popular))
##
## 0 1
## 0 0.80381235 0.15471588
## 1 0.01998823 0.02148354
Save model for future use.
saveRDS(model, 'model-LR.rds')
saveRDS(audible, 'audible-complete.rds')