PREDICTING THE POPULARITY OF NEWS STORIES - MITx15.071x The Analytics Edge

Introduction

Newspapers and online news aggregators like Google News need to prioritize news stories to determine which will be the most popular. In this problem, you will predict the popularity of a set of New York Times articles containing the words “Google”, “Microsoft”, or “Yahoo” from the time period May 2012-December 2013. The dependent variable in this problem is the variable popular, which labels if an article had 100 or more comments in its online comment section. The independent variables consist of a number of pieces of article metadata available at the time of publication:

# setwd('Analytics/Final/Problem2') Read data
articles <- read.csv("nytimes.csv", stringsAsFactors = FALSE)

# What proportion of articles had at least 100 comments?
sum(articles$popular)/nrow(articles)
## [1] 0.1079

# What is the correlation between the number of characters in an article's
# headline and whether the popular flag is set?
cor(nchar(articles$headline), articles$popular)
## [1] -0.1127

# Convert the 'popular' and 'type' variables to be factor variables with the
# as.factor() function.
articles$popular <- as.factor(articles$popular)
articles$type <- as.factor(articles$type)

# Split data into training and test sets Set seed
set.seed(144)
library(caTools)
spl <- sample.split(articles$popular, SplitRatio = 0.7)
train <- subset(articles, spl == T)
test <- subset(articles, spl == F)

# Train a logistic regression model
glm1 <- glm(popular ~ print + type + word.count, data = train, family = "binomial")
summary(glm1)
## 
## Call:
## glm(formula = popular ~ print + type + word.count, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.177  -0.457  -0.427  -0.285   2.558  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.507557   0.501969   -5.00  5.9e-07 ***
## print       -0.846833   0.306755   -2.76   0.0058 ** 
## typeNews     0.905593   0.403253    2.25   0.0247 *  
## typeOther    0.944076   0.605789    1.56   0.1191    
## word.count   0.000260   0.000106    2.45   0.0143 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 468.37  on 681  degrees of freedom
## Residual deviance: 432.05  on 677  degrees of freedom
## AIC: 442.1
## 
## Number of Fisher Scoring iterations: 5

# Consider an article that was printed in the newspaper (print = 1) with
# type = 'News' and a total word count of 682. What is the predicted
# probability of this observation being popular, according to this model?
exp(sum(glm1$coef * c(1, 1, 1, 0, 682)))/(exp(sum(glm1$coef * c(1, 1, 1, 0, 
    682))) + 1)
## [1] 0.09351

The following is an illustration of how we derived the above expression for the probability of an observation being popular.

The logit function is given by: \[ \log(\frac{p}{1-p})=\exp(\beta x) \] \[ \implies \frac{p}{1-p} = \exp(\beta x) \] \[ \implies p = \exp(\beta x) - p\exp(\beta x) \] \[ \implies p+p\exp(\beta x) = \exp(\beta x) \] \[ \implies p(1+\exp(\beta x)) = \exp(\beta x) \] \[ \implies p = \frac{\exp(\beta x)}{1+\exp(\beta x)} \]

The coefficients of the model are the log odds associated with that variable; so we see that the odds of being popular are exp(-0.8468333)=0.4287706 those of an otherwise identical non-print article. This means the print article is predicted to have 57.1% lower odds of being popular.

We can obtain test-set predictions using the predict function. If you then summarize your predictions, you can see that the maximum predicted probability of being popular is 0.488, so no observations will be labeled as popular using a threshold of 0.5. As a result, the logistic regression predictions exactly coincide with the predictions of the naive baseline model.

predTest <- predict(glm1, newdata = test, type = "response")
max(predTest)
## [1] 0.488

# AUC
library(ROCR)
## Loading required package: gplots
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
ROCRpred = prediction(predTest, test$popular)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.7854

The AUC is the proportion of time the model can differentiate between a randomly selected true positive and true negative. Check the following site for more illustration. The ROC curve plots the true and false positive rates for all cutoffs between 0 and 1.

perf <- performance(ROCRpred, "tpr", "fpr")
plot(perf, colorize = TRUE)

plot of chunk unnamed-chunk-3

For the above plot, check the following stackoverflow post.

10-fold Cross-Validation

In 10-fold cross validation, the model with each parameter setting will be trained on 10 90% subsets of the training set. Assume that the model is trained to select between 3 different pararmeters. Then, a total of 30 models will be trained. The models are evaluated in each case on the last 10% of the training set (not on the testing set).

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(rpart)
library(rpart.plot)

set.seed(144)
# Define cross-validation experiment
fitControl = trainControl(method = "cv", number = 10)
cartGrid = expand.grid(.cp = (1:50) * 0.01)
train(popular ~ type + print + word.count, data = train, method = "rpart", trControl = fitControl, 
    tuneGrid = cartGrid)
## CART 
## 
## 682 samples
##   5 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 614, 614, 614, 613, 614, 614, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.01  0.9       0      0.007        0       
##   0.02  0.9       0      0.007        0       
##   0.03  0.9       0      0.007        0       
##   0.04  0.9       0      0.007        0       
##   0.05  0.9       0      0.007        0       
##   0.06  0.9       0      0.007        0       
##   0.07  0.9       0      0.007        0       
##   0.08  0.9       0      0.007        0       
##   0.09  0.9       0      0.007        0       
##   0.1   0.9       0      0.007        0       
##   0.1   0.9       0      0.007        0       
##   0.1   0.9       0      0.007        0       
##   0.1   0.9       0      0.007        0       
##   0.1   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.2   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.3   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.4   0.9       0      0.007        0       
##   0.5   0.9       0      0.007        0       
##   0.5   0.9       0      0.007        0       
##   0.5   0.9       0      0.007        0       
##   0.5   0.9       0      0.007        0       
##   0.5   0.9       0      0.007        0       
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.5.

# Use cp=0.01 to build the CART model
popularTree <- rpart(popular ~ print + type + word.count, data = train, control = rpart.control(cp = 0.01))
# Plot tree prp(popularTree)

BUILDING A CORPUS FROM ARTICLE SNIPPETS

In the last part of this problem, we will determine if text analytics can be used to improve the quality of predictions of which articles will be popular.

Build a corpus called “corpus” using the snippet variable from the full data frame “articles”. Using the tm_map() function, perform the following pre-processing steps on the corpus:

1) Convert all words to lowercase

2) Remove punctuation

3) Remove English stop words

4) Stem the document

library(tm)

# Build corpus
corpus <- Corpus(VectorSource(articles$snippet))

# Convert all words to lowercase
corpus <- tm_map(corpus, tolower)

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Stem the document
corpus <- tm_map(corpus, stemDocument)

# Build a document-term matrix called 'dtm' from the preprocessed corpus.
(dtm <- DocumentTermMatrix(corpus))
## A document-term matrix (973 documents, 3926 terms)
## 
## Non-/sparse entries: 13896/3806102
## Sparsity           : 100%
## Maximal term length: 19 
## Weighting          : term frequency (tf)

# Remove all terms that don't appear in at least 5% of documents in the
# corpus, storing the result in a new document term matrix called spdtm.
(spdtm = removeSparseTerms(dtm, 0.95))
## A document-term matrix (973 documents, 17 terms)
## 
## Non-/sparse entries: 1284/15257
## Sparsity           : 92%
## Maximal term length: 9 
## Weighting          : term frequency (tf)

# Convert spdtm to a data frame called articleTex
articleText = as.data.frame(as.matrix(spdtm))
# Which word stem appears the most frequently across all snippets?
which.max(colSums(articleText))
## compani 
##       4

# Add the 3 variables (print, type, word.count and popular) to the
# articleTex data frame
articleText$print <- articles$print
articleText$type <- articles$type
articleText$word.count <- articles$word.count
articleText$popular <- articles$popular

# Split data set articleTex
trainText <- subset(articleText, spl == T)
testText <- subset(articleText, spl == F)

# Build a logistic model
glmText <- glm(popular ~ ., data = articleText, family = "binomial")
summary(glmText)
## 
## Call:
## glm(formula = popular ~ ., family = "binomial", data = articleText)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.340  -0.458  -0.365  -0.278   2.643  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.114816   0.442748   -4.78  1.8e-06 ***
## busi        -0.594950   0.591548   -1.01   0.3145    
## can          0.134864   0.418127    0.32   0.7470    
## chief       -0.287514   0.726052   -0.40   0.6921    
## compani     -0.145080   0.316033   -0.46   0.6462    
## execut      -0.144799   0.632561   -0.23   0.8189    
## googl       -0.119416   0.428949   -0.28   0.7807    
## make         0.033862   0.423871    0.08   0.9363    
## microsoft   -0.124776   0.418532   -0.30   0.7656    
## new         -0.161868   0.308156   -0.53   0.5994    
## offer       -0.766250   0.718913   -1.07   0.2865    
## one         -0.143087   0.429617   -0.33   0.7391    
## said         0.497256   0.417942    1.19   0.2341    
## servic      -1.287350   0.998880   -1.29   0.1975    
## technolog   -0.320065   0.440932   -0.73   0.4679    
## will        -0.371592   0.466229   -0.80   0.4254    
## yahoo        0.421550   0.369194    1.14   0.2535    
## year         0.151844   0.489065    0.31   0.7562    
## print       -1.037138   0.266892   -3.89   0.0001 ***
## typeNews     0.693506   0.343055    2.02   0.0432 *  
## typeOther    0.911446   0.489561    1.86   0.0626 .  
## word.count   0.000292   0.000101    2.90   0.0037 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 665.79  on 972  degrees of freedom
## Residual deviance: 588.15  on 951  degrees of freedom
## AIC: 632.1
## 
## Number of Fisher Scoring iterations: 6

# Predict using testText
predTestText <- predict(glmText, newdata = testText, type = "response")
# AUC
ROCRpredText = prediction(predTestText, testText$popular)
as.numeric(performance(ROCRpredText, "auc")@y.values)
## [1] 0.7835

glmText has more variables than the base logistic regression model, but it exhibits worse test-set performance. Therefore, the model is overfitted and removing variables would improve the test-set performance.