Classifying Popular Audiobooks with Logistic Regression

GitHub - https://github.com/rickysoo/audiobooks

Load Data

In this exercise, we are going to classify and predict popular audiobooks on Audible.com. The data has previously been sourced and cleaned. About 17.6% of the items have been labeled as “popular” based on certain criteria.

Load libraries.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: lattice

library(broom)

Load audiobook data.

audible <- read.csv('audible_cleaned.csv')
# audible <- audible %>% sample_n(1000)
audible$Popular <- factor(audible$Popular)

Explore Data

Quick check on data.

str(audible)

## 'data.frame':    355109 obs. of  11 variables:
##  $ ASIN        : chr  "0062823485" "B0052OM9XK" "B077XN66F8" "B00C75YMJ6" ...
##  $ Author      : chr  "Paul Stanley" "Simon Pegg" "Sidney Lumet" "Phil Robertson" ...
##  $ Category    : chr  "Arts & Entertainment" "Arts & Entertainment" "Arts & Entertainment" "Arts & Entertainment" ...
##  $ Narrator    : chr  "Sean Pratt" "Simon Pegg" "Richard M. Davidson" "Al Robertson, Phil Robertson" ...
##  $ Price       : num  20.5 24.5 21 21 53 ...
##  $ Rating      : num  4 4.5 5 4.5 5 4.5 4.5 4.5 3 4.5 ...
##  $ RatingCount : int  78 872 90 805 585 390 1398 60 4 841 ...
##  $ Title       : chr  "Backstage Pass" "Nerd Do Well" "Making Movies" "Happy, Happy, Happy" ...
##  $ Length      : int  278 290 443 297 1465 471 618 463 2336 1278 ...
##  $ DaysReleased: int  8521 5639 8010 6337 6399 7891 6063 8333 8521 6477 ...
##  $ Popular     : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...

table(audible$Popular)

## 
##      0      1 
## 292539  62570

sum(is.na(audible))

## [1] 0

Explore variables using histograms.

ggplot(audible, aes(x = Price, fill = Popular)) +
    geom_histogram(bins = 30)

ggplot(audible, aes(x = RatingCount, fill = Popular)) +
    geom_histogram(bins = 30)

RatingCount is found to be skewed so it is log-transformed to make it more normally distributed for regression purpose.

audible$RatingCountLog <- log(audible$RatingCount)

ggplot(audible, aes(x = RatingCountLog, fill = Popular)) +
    geom_histogram(bins = 30)

ggplot(audible, aes(x = Length, fill = Popular)) +
    geom_histogram(bins = 30)

Length is found to be skewed so it is log-transformed to make it more normally distributed for regression purpose.

audible$LengthLog <- log(audible$Length)

ggplot(audible, aes(x = LengthLog, fill = Popular)) +
    geom_histogram(bins = 30)

ggplot(audible, aes(x = DaysReleased, fill = Popular)) +
    geom_histogram(bins = 30)

DaysReleased is found to be skewed so it is log-transformed to make it more normally distributed for regression purpose.

audible$DaysReleasedLog <- log(max(audible$DaysReleased) + 1 - audible$DaysReleased)

ggplot(audible, aes(x = DaysReleasedLog, fill = Popular)) +
    geom_histogram(bins = 30)

Explore the data using bar plot.

ggplot(audible, aes(y = Category, fill = Popular)) +
    geom_bar(stat = 'count')

Self-development books are found to have a high proportion of popular books. A new variable SelfDevelopment is added to possibly capture this pattern in model training.

audible$SelfDevelopment <- ifelse(audible$Category == 'Self Development', 1, 0)
table(audible$SelfDevelopment)

## 
##      0      1 
## 323008  32101

audible_model <- audible[c('Price', 'RatingCountLog', 'LengthLog', 'DaysReleasedLog', 'SelfDevelopment', 'Popular')]

Train Model

Split data for training. 80% for training set, 20% for test set.

set.seed(88)
samples <- createDataPartition(audible_model$Popular, p = 0.8, list = FALSE)
train.data <- audible_model[samples, ]
test.data <- audible_model[-samples, ]

Train a logistic regression model.

model <- glm(Popular ~ Price + RatingCountLog + LengthLog + DaysReleasedLog + SelfDevelopment, data = train.data, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Popular ~ Price + RatingCountLog + LengthLog + 
##     DaysReleasedLog + SelfDevelopment, family = binomial, data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3669  -0.6340  -0.4155  -0.2215   3.7484  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.235997   0.044952   71.99   <2e-16 ***
## Price           -0.011608   0.000666  -17.43   <2e-16 ***
## RatingCountLog  -0.439294   0.003762 -116.78   <2e-16 ***
## LengthLog       -0.059022   0.005420  -10.89   <2e-16 ***
## DaysReleasedLog -0.484566   0.004506 -107.54   <2e-16 ***
## SelfDevelopment  0.396757   0.016404   24.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 264532  on 284087  degrees of freedom
## Residual deviance: 223195  on 284082  degrees of freedom
## AIC: 223207
## 
## Number of Fisher Scoring iterations: 5

Interprete Model

All variables are found to be statistically significant in this model.

The coefficients indicate how the presence of the individual variables affect the ‘log odds ratio’ of an item being popular or not. For example, if an item belongs to the self-development category, the log odds ratio of the item being popular is exp(0.4) = 1.49 times larger than when the item does not belong to the category.

Log odds ratio = log(p/(1-p)) where p is the probability of an item being popular being.

coefficients <- summary(model)$coefficients

data.frame(
  'Variables' = row.names(coefficients)[2:6],
  'Coefficients' = round(coefficients[2:6], 3),
  'Log Odds' = paste0(round(exp(coefficients[2:6]), 3), ' times')
)

##         Variables Coefficients    Log.Odds
## 1           Price       -0.012 0.988 times
## 2  RatingCountLog       -0.439 0.644 times
## 3       LengthLog       -0.059 0.943 times
## 4 DaysReleasedLog       -0.485 0.616 times
## 5 SelfDevelopment        0.397 1.487 times

From this table, it can be seen that price has minimal effect on the odd of an item being popular or not. But a self-development book can make the log odds ratio of being popular larger by about 1.49 times.

For the log-transformed variables, one unit of increase means 10^1 = 10 units of increase in the original variable such as the rating counts. In this case, an increase of 10 rating counts can make the log odds ratio smaller to about 0.64 times than before!

Test Model

Check the accuracy of prediction.

test_probabilities <- predict(model, test.data, type = 'response')
test_popular <- ifelse(test_probabilities > 0.5, 1, 0)
accuracy <- mean(test_popular == test.data$Popular)
accuracy

## [1] 0.8257417

Check the classification matrix.

test_popular <- as.factor(test_popular)
CM_test <- confusionMatrix(test_popular, test.data$Popular, positive = '1')
CM_test

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 57071 10940
##          1  1436  1574
##                                           
##                Accuracy : 0.8257          
##                  95% CI : (0.8229, 0.8285)
##     No Information Rate : 0.8238          
##     P-Value [Acc > NIR] : 0.08769         
##                                           
##                   Kappa : 0.1443          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.12578         
##             Specificity : 0.97546         
##          Pos Pred Value : 0.52292         
##          Neg Pred Value : 0.83914         
##              Prevalence : 0.17620         
##          Detection Rate : 0.02216         
##    Detection Prevalence : 0.04238         
##       Balanced Accuracy : 0.55062         
##                                           
##        'Positive' Class : 1               
##

fourfoldplot(CM_test$table)

Show the performance metrics.

CM_test$overall['Accuracy']

##  Accuracy 
## 0.8257417

CM_test$byClass['Precision']

## Precision 
## 0.5229236

CM_test$byClass['Recall']

##    Recall 
## 0.1257791

While the accuracy is fine over 82%, the precision and recall are low at about 52% and 12% respectively.

It means out of all items predicted as popular, only about 52% are truly labeled as popular, and out of all items labeled as popular, only about 12% are predicted to be popular.

table(test_popular, test.data$Popular)

##             
## test_popular     0     1
##            0 57071 10940
##            1  1436  1574

prop.table(table(test_popular, test.data$Popular))

##             
## test_popular          0          1
##            0 0.80357922 0.15403895
##            1 0.02021937 0.02216246

Finalize Model

Make prediction on all audiobooks.

probabilities <- predict(model, audible, type = 'response')
audible$Predicted <- ifelse(probabilities > 0.5, 1, 0)
accuracy <- mean(audible$Predicted == audible$Popular)
accuracy

## [1] 0.8252959

audible$Predicted <- as.factor(audible$Predicted)
CM <- confusionMatrix(audible$Predicted, audible$Popular, positive = '1')
CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 285441  54941
##          1   7098   7629
##                                          
##                Accuracy : 0.8253         
##                  95% CI : (0.824, 0.8265)
##     No Information Rate : 0.8238         
##     P-Value [Acc > NIR] : 0.009674       
##                                          
##                   Kappa : 0.1396         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.12193        
##             Specificity : 0.97574        
##          Pos Pred Value : 0.51803        
##          Neg Pred Value : 0.83859        
##              Prevalence : 0.17620        
##          Detection Rate : 0.02148        
##    Detection Prevalence : 0.04147        
##       Balanced Accuracy : 0.54883        
##                                          
##        'Positive' Class : 1              
##

fourfoldplot(CM$table)