example from MITx The Analytics as Edge

the data

from Twitter. Tweets about Apple, rated for sentiment.

tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

negative sentiments

tweets$negative <- as.factor(tweets$Avg <= -1)
table(tweets$negative)
## 
## FALSE  TRUE 
##   999   182

corpus

A corpus is a collection of documents. We’ll need to convert our tweets to a corpus for pre-processing.

corpus <- VCorpus(VectorSource(tweets$Tweet))
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"

stemming, removing stop words

Stemming or removing stop words can be done with the tm_map function.

corpus <- tm_map(corpus, content_transformer(tolower))
corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus <- tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"

stop words in tm

stopwords("english")[1:10]
##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"

Remove ‘Apple’ (present in all these tweets about Apple) and all English stopwords.

corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]$content
## [1] "   say    far  best customer care service   ever received  appstore"

stem the words

corpus <- tm_map(corpus, stemDocument)
corpus[[1]]$content
## [1] "say far best custom care servic ever receiv appstor"

bag of words in R

The tm package provides a function called DocumentTermMatrix that generates a matrix where the rows correspond to documents, in our case tweets, and the columns correspond to words in those tweets.

frequencies <- DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)
inspect(frequencies[1000:1005, 505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu
##   1000       0       0     0     0     0       0         0     0        0
##   1001       0       0     0     0     0       0         0     0        0
##   1002       0       0     0     0     0       0         0     0        0
##   1003       0       0     0     0     0       0         0     0        0
##   1004       0       0     0     0     0       0         0     0        0
##   1005       0       0     0     0     1       0         0     0        0
##       Terms
## Docs   child
##   1000     0
##   1001     0
##   1002     0
##   1003     0
##   1004     0
##   1005     0
# documents from 1000 to 1005
# words from 505 to 515

“cheer” appears in tweet 1005. “cheap” doesn’t appear in any of these tweets.

This data is what we call sparse. There are many zeros in our matrix.

dealing with sparsity

What are the most popular (frequent) terms?

findFreqTerms(frequencies, lowfreq = 100)
## [1] "iphon" "itun"  "new"
# minimum number of times a term must appear to be displayed is one hundred
findFreqTerms(frequencies, lowfreq = 20)
##  [1] "android"              "anyon"                "app"                 
##  [4] "appl"                 "back"                 "batteri"             
##  [7] "better"               "buy"                  "can"                 
## [10] "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                 
## [16] "googl"                "ios7"                 "ipad"                
## [19] "iphon"                "iphone5"              "iphone5c"            
## [22] "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                 
## [28] "look"                 "love"                 "make"                
## [31] "market"               "microsoft"            "need"                
## [34] "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"               
## [40] "promoipodplayerpromo" "realli"               "releas"              
## [43] "samsung"              "say"                  "store"               
## [46] "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                 
## [52] "via"                  "want"                 "well"                
## [55] "will"                 "work"
# minimum number of times a term must appear to be displayed is twenty

Fifty-six (56) different words appear at least twenty times, out of 3289 words in the matrix. Many words are probably pretty useless for our prediction model.

More terms means more independent variables, which results in longer computation time to build our models.

Ratio of independent variables to observations will affect how good the model will generalize.

sparse <- removeSparseTerms(frequencies, 0.995)

# sparsity threshold
#  if 0.98 - keep terms that appear in 2% or more tweets
#  if 0.995 - keep terms that appear in 0.5% or more tweets

sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)
tweetsSparse <- as.data.frame(as.matrix(sparse))

colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))
# some words start with a number, but R struggles with variables names that start with a number

tweetsSparse$negative <- tweets$negative

create training and test set

set.seed(123)
split <- sample.split(tweetsSparse$negative, SplitRatio = 0.7)
trainSparse <- subset(tweetsSparse, split == TRUE)
testSparse <- subset(tweetsSparse, split == FALSE)

CART model

tweetCART <- rpart(negative ~ .,
                   data = trainSparse, method = "class")
prp(tweetCART)

make predictions

predictCART <- predict(tweetCART, newdata = testSparse, type = "class")
confusionmatrix <- table(testSparse$negative, predictCART)
confusionmatrix
##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(testSparse))
## 
## Accuracy 0.8788732

baseline model accuracy

where the default is ‘non-negative’ sentiment

negativematrix <- table(testSparse$negative)
cat("\nAccuracy", 300/355)
## 
## Accuracy 0.8450704

random forest model

Takes much longer to build a random forest model, due to the large number of independent variables.

set.seed(123)
tweetRF <- randomForest(negative ~ ., data = trainSparse)
predictRF <- predict(tweetRF, newdata = testSparse)
confusionmatrix <- table(testSparse$negative, predictRF)
confusionmatrix
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(testSparse))
## 
## Accuracy 0.884507

logistic regression model

tweetlogit <- glm(negative ~ ., data = trainSparse, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(tweetlogit)
## 
## Call:
## glm(formula = negative ~ ., family = "binomial", data = trainSparse)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -8.49    0.00    0.00    0.00    8.49  
## 
## Coefficients: (7 not defined because of singularities)
##                        Estimate Std. Error    z value Pr(>|z|)    
## (Intercept)          -1.047e+15  5.330e+06 -196471463   <2e-16 ***
## X244tsuyoponzu        1.047e+15  3.048e+07   34353213   <2e-16 ***
## X7evenstarz           2.690e+15  4.542e+07   59227701   <2e-16 ***
## actual               -7.674e+14  2.884e+07  -26611097   <2e-16 ***
## add                  -8.849e+14  6.062e+07  -14596618   <2e-16 ***
## alreadi              -2.947e+14  3.772e+07   -7812707   <2e-16 ***
## alway                -7.146e+14  4.813e+07  -14847398   <2e-16 ***
## amaz                 -6.693e+14  3.977e+07  -16829776   <2e-16 ***
## amazon               -2.370e+15  9.950e+07  -23822526   <2e-16 ***
## android               7.892e+14  2.007e+07   39317161   <2e-16 ***
## announc              -2.446e+15  2.809e+07  -87079701   <2e-16 ***
## anyon                -4.159e+14  2.607e+07  -15950953   <2e-16 ***
## app                   3.770e+14  1.556e+07   24219551   <2e-16 ***
## appl                 -5.837e+14  1.695e+07  -34427793   <2e-16 ***
## appstor              -1.635e+15  3.697e+07  -44230461   <2e-16 ***
## arent                -1.262e+15  4.287e+07  -29425117   <2e-16 ***
## ask                  -1.322e+15  3.861e+07  -34243520   <2e-16 ***
## avail                -2.055e+15  3.782e+07  -54346409   <2e-16 ***
## away                 -5.002e+13  4.471e+07   -1118735   <2e-16 ***
## awesom                8.366e+14  3.464e+07   24154252   <2e-16 ***
## back                 -1.905e+15  2.304e+07  -82683091   <2e-16 ***
## batteri              -1.136e+15  3.029e+07  -37503023   <2e-16 ***
## best                  1.069e+15  3.759e+07   28445859   <2e-16 ***
## better                3.124e+14  2.169e+07   14401591   <2e-16 ***
## big                  -2.229e+15  3.069e+07  -72636420   <2e-16 ***
## bit                   8.489e+13  2.799e+07    3033255   <2e-16 ***
## black                 1.121e+15  3.684e+07   30419212   <2e-16 ***
## blackberri           -2.069e+15  3.251e+07  -63650797   <2e-16 ***
## break.                1.791e+15  3.619e+07   49483853   <2e-16 ***
## bring                -1.517e+15  3.576e+07  -42430018   <2e-16 ***
## burberri              4.290e+14  3.593e+07   11938729   <2e-16 ***
## busi                 -3.320e+15  3.130e+07 -106072554   <2e-16 ***
## buy                   2.066e+15  2.469e+07   83699753   <2e-16 ***
## call                  1.712e+15  3.203e+07   53454609   <2e-16 ***
## can                  -6.827e+14  1.522e+07  -44842110   <2e-16 ***
## cant                  1.170e+15  2.174e+07   53816650   <2e-16 ***
## carbon               -1.773e+15  1.194e+08  -14843110   <2e-16 ***
## card                  4.728e+15  5.740e+07   82355362   <2e-16 ***
## care                  3.600e+15  3.773e+07   95400584   <2e-16 ***
## case                  1.364e+14  3.509e+07    3885327   <2e-16 ***
## cdp                   1.503e+15  8.722e+07   17233989   <2e-16 ***
## chang                -4.918e+13  3.216e+07   -1529048   <2e-16 ***
## charg                 1.025e+15  4.422e+07   23173007   <2e-16 ***
## charger              -1.167e+14  2.806e+07   -4158504   <2e-16 ***
## cheap                 6.726e+13  3.328e+07    2020913   <2e-16 ***
## china                 8.542e+14  2.605e+07   32790275   <2e-16 ***
## color                -1.923e+15  2.638e+07  -72903150   <2e-16 ***
## colour                2.047e+15  5.176e+07   39555035   <2e-16 ***
## come                 -3.326e+14  1.973e+07  -16855683   <2e-16 ***
## compani              -5.980e+14  2.785e+07  -21474343   <2e-16 ***
## condescens            3.956e+14  6.016e+07    6575506   <2e-16 ***
## condom                2.297e+15  4.130e+07   55607494   <2e-16 ***
## copi                  6.395e+14  4.424e+07   14454035   <2e-16 ***
## crack                -5.958e+14  4.425e+07  -13464583   <2e-16 ***
## creat                -3.711e+14  3.639e+07  -10199949   <2e-16 ***
## custom                3.686e+14  4.341e+07    8492670   <2e-16 ***
## darn                  7.905e+14  3.199e+07   24715387   <2e-16 ***
## data                  1.420e+15  3.652e+07   38892983   <2e-16 ***
## date                 -4.303e+14  4.424e+07   -9726799   <2e-16 ***
## day                   1.896e+15  2.777e+07   68276510   <2e-16 ***
## dear                  9.422e+14  3.169e+07   29734282   <2e-16 ***
## design               -4.603e+14  3.663e+07  -12565908   <2e-16 ***
## develop              -9.558e+14  2.670e+07  -35795214   <2e-16 ***
## devic                -6.933e+14  2.559e+07  -27089897   <2e-16 ***
## didnt                 1.184e+15  3.518e+07   33646631   <2e-16 ***
## die                   2.589e+14  3.535e+07    7322275   <2e-16 ***
## differ                1.321e+15  3.527e+07   37445641   <2e-16 ***
## disappoint            3.459e+15  3.989e+07   86706178   <2e-16 ***
## discontinu            4.374e+15  1.268e+08   34484248   <2e-16 ***
## divulg                7.415e+15  2.769e+08   26776544   <2e-16 ***
## doesnt               -7.889e+14  3.710e+07  -21267737   <2e-16 ***
## done                 -1.117e+14  3.079e+07   -3626397   <2e-16 ***
## dont                 -2.275e+14  1.925e+07  -11817691   <2e-16 ***
## download              2.025e+15  4.755e+07   42585707   <2e-16 ***
## drop                 -5.285e+14  3.445e+07  -15342473   <2e-16 ***
## email                -1.334e+15  4.057e+07  -32882439   <2e-16 ***
## emiss                 1.402e+15  1.884e+08    7443382   <2e-16 ***
## emoji                 2.721e+14  3.459e+07    7864115   <2e-16 ***
## even                  2.197e+15  3.862e+07   56894488   <2e-16 ***
## event                 5.191e+14  3.959e+07   13111789   <2e-16 ***
## ever                  1.262e+15  4.475e+07   28199724   <2e-16 ***
## everi                 9.745e+14  2.653e+07   36737236   <2e-16 ***
## everyth               9.316e+14  6.367e+07   14632245   <2e-16 ***
## facebook              2.034e+15  9.430e+07   21570316   <2e-16 ***
## fail                 -2.029e+15  3.510e+07  -57801020   <2e-16 ***
## featur               -1.225e+14  3.656e+07   -3349530   <2e-16 ***
## feel                  1.000e+15  3.921e+07   25504940   <2e-16 ***
## femal                        NA         NA         NA       NA    
## figur                -2.740e+15  4.623e+07  -59260032   <2e-16 ***
## final                -2.953e+15  5.950e+07  -49634975   <2e-16 ***
## finger               -2.896e+15  3.418e+07  -84720574   <2e-16 ***
## fingerprint           7.066e+14  2.298e+07   30744148   <2e-16 ***
## fire                  3.260e+15  4.933e+07   66074821   <2e-16 ***
## first                 2.238e+15  4.033e+07   55482242   <2e-16 ***
## fix                  -1.043e+13  3.280e+07    -317979   <2e-16 ***
## follow               -3.606e+14  2.524e+07  -14287286   <2e-16 ***
## freak                 1.865e+15  1.269e+07  146995843   <2e-16 ***
## free                 -2.040e+14  3.469e+07   -5880535   <2e-16 ***
## fun                  -1.442e+15  3.393e+07  -42494969   <2e-16 ***
## generat               2.203e+15  4.350e+07   50649733   <2e-16 ***
## genius               -2.805e+15  5.186e+07  -54084196   <2e-16 ***
## get                  -1.915e+14  1.286e+07  -14892955   <2e-16 ***
## give                 -3.420e+14  2.563e+07  -13340846   <2e-16 ***
## gold                 -1.719e+15  2.879e+07  -59692323   <2e-16 ***
## gonna                -1.788e+15  3.398e+07  -52617649   <2e-16 ***
## good                  1.026e+15  3.393e+07   30228875   <2e-16 ***
## googl                 1.697e+14  2.303e+07    7369213   <2e-16 ***
## got                   1.131e+14  3.399e+07    3327497   <2e-16 ***
## great                -1.874e+15  4.510e+07  -41544656   <2e-16 ***
## guess                 4.047e+14  4.829e+07    8381102   <2e-16 ***
## guy                  -1.489e+15  3.129e+07  -47585590   <2e-16 ***
## happen               -1.477e+15  4.406e+07  -33520895   <2e-16 ***
## happi                -1.022e+15  4.359e+07  -23449803   <2e-16 ***
## hate                  2.889e+15  2.513e+07  114941958   <2e-16 ***
## help                 -2.379e+15  2.513e+07  -94663824   <2e-16 ***
## hey                  -3.295e+14  2.554e+07  -12900779   <2e-16 ***
## hope                  1.719e+15  3.280e+07   52395960   <2e-16 ***
## hour                 -1.571e+15  3.524e+07  -44561828   <2e-16 ***
## httpbitly18xc8dk     -8.622e+14  9.701e+07   -8887484   <2e-16 ***
## ibrooklynb           -5.675e+14  3.801e+07  -14931892   <2e-16 ***
## idea                  1.124e+15  4.436e+07   25345397   <2e-16 ***
## ill                   2.106e+15  5.577e+07   37756437   <2e-16 ***
## imessag               1.849e+15  3.977e+07   46502196   <2e-16 ***
## impress              -5.331e+14  3.082e+07  -17295165   <2e-16 ***
## improv                2.019e+15  3.456e+07   58426788   <2e-16 ***
## innov                -2.541e+14  2.963e+07   -8576540   <2e-16 ***
## instead              -5.642e+14  4.377e+07  -12889118   <2e-16 ***
## internet              2.183e+15  4.725e+07   46200453   <2e-16 ***
## ios7                  7.545e+14  2.211e+07   34120577   <2e-16 ***
## ipad                  3.161e+14  1.632e+07   19368679   <2e-16 ***
## iphon                 4.065e+14  7.159e+06   56790849   <2e-16 ***
## iphone4               3.403e+14  4.920e+07    6916192   <2e-16 ***
## iphone5              -4.880e+14  1.469e+07  -33220704   <2e-16 ***
## iphone5c             -1.075e+15  1.722e+07  -62421271   <2e-16 ***
## iphoto               -1.909e+15  8.144e+07  -23434643   <2e-16 ***
## ipod                 -5.428e+14  4.168e+07  -13022484   <2e-16 ***
## ipodplayerpromo       2.216e+15  5.638e+07   39310168   <2e-16 ***
## isnt                 -1.732e+15  3.342e+07  -51838596   <2e-16 ***
## itun                  1.004e+15  2.475e+07   40578840   <2e-16 ***
## ive                  -2.412e+15  4.976e+07  -48476344   <2e-16 ***
## job                   2.604e+15  5.409e+07   48139015   <2e-16 ***
## just                  2.243e+14  1.485e+07   15109878   <2e-16 ***
## keynot               -1.352e+15  3.632e+07  -37209467   <2e-16 ***
## know                 -3.530e+14  2.869e+07  -12304243   <2e-16 ***
## last                 -1.140e+15  4.573e+07  -24921705   <2e-16 ***
## launch               -1.950e+15  4.400e+07  -44315729   <2e-16 ***
## let                  -1.297e+15  3.636e+07  -35664023   <2e-16 ***
## life                  1.027e+15  3.133e+07   32779740   <2e-16 ***
## like                  3.430e+14  1.380e+07   24850189   <2e-16 ***
## line                  1.853e+15  5.901e+07   31393396   <2e-16 ***
## lmao                 -1.490e+15  4.065e+07  -36654917   <2e-16 ***
## lock                 -2.961e+15  4.315e+07  -68620829   <2e-16 ***
## lol                  -1.465e+14  2.226e+07   -6577814   <2e-16 ***
## look                 -3.959e+14  1.927e+07  -20546515   <2e-16 ***
## los                  -5.600e+14  6.879e+07   -8141108   <2e-16 ***
## lost                  8.202e+14  4.914e+07   16691475   <2e-16 ***
## love                 -1.509e+15  2.229e+07  -67686561   <2e-16 ***
## mac                  -3.286e+14  2.872e+07  -11440789   <2e-16 ***
## macbook              -8.167e+14  3.650e+07  -22372762   <2e-16 ***
## made                 -1.594e+15  3.428e+07  -46501317   <2e-16 ***
## make                 -5.678e+14  1.544e+07  -36771301   <2e-16 ***
## man                  -3.677e+14  3.996e+07   -9201168   <2e-16 ***
## mani                  6.847e+14  4.655e+07   14708969   <2e-16 ***
## market                1.958e+15  2.693e+07   72698006   <2e-16 ***
## mayb                 -5.275e+14  3.961e+07  -13317002   <2e-16 ***
## mean                 -1.506e+15  4.017e+07  -37483828   <2e-16 ***
## microsoft            -2.728e+14  2.108e+07  -12944707   <2e-16 ***
## mishiza              -2.840e+15  3.955e+07  -71799020   <2e-16 ***
## miss                  2.176e+15  4.084e+07   53295830   <2e-16 ***
## mobil                -1.796e+15  2.734e+07  -65696457   <2e-16 ***
## money                -1.217e+15  6.131e+07  -19843102   <2e-16 ***
## motorola              3.469e+14  4.287e+07    8093391   <2e-16 ***
## move                 -1.632e+15  5.308e+07  -30742354   <2e-16 ***
## much                  7.026e+14  3.177e+07   22116210   <2e-16 ***
## music                -2.967e+15  6.402e+07  -46337541   <2e-16 ***
## natz0711                     NA         NA         NA       NA    
## need                 -5.313e+14  1.821e+07  -29181866   <2e-16 ***
## never                -1.550e+15  3.783e+07  -40967153   <2e-16 ***
## new                  -1.018e+14  1.066e+07   -9546777   <2e-16 ***
## news                 -8.980e+14  3.033e+07  -29607919   <2e-16 ***
## next.                 5.650e+13  2.615e+07    2160383   <2e-16 ***
## nfc                  -2.881e+15  4.427e+07  -65093797   <2e-16 ***
## nokia                -2.656e+14  2.443e+07  -10868917   <2e-16 ***
## noth                 -5.459e+14  4.374e+07  -12480163   <2e-16 ***
## now                  -2.744e+14  1.603e+07  -17117798   <2e-16 ***
## nsa                   1.423e+13  3.148e+07     451904   <2e-16 ***
## nuevo                -1.817e+15  6.661e+07  -27282254   <2e-16 ***
## offer                -2.543e+15  3.757e+07  -67681569   <2e-16 ***
## old                   6.831e+14  3.002e+07   22751976   <2e-16 ***
## one                  -9.342e+14  1.629e+07  -57331111   <2e-16 ***
## page                 -2.348e+15  4.623e+07  -50779375   <2e-16 ***
## para                  6.474e+12  3.057e+07     211809   <2e-16 ***
## peopl                -2.220e+14  2.695e+07   -8237006   <2e-16 ***
## perfect              -1.465e+15  6.262e+07  -23386777   <2e-16 ***
## person                2.251e+15  4.550e+07   49461017   <2e-16 ***
## phone                -3.136e+14  1.304e+07  -24040686   <2e-16 ***
## photog                       NA         NA         NA       NA    
## photographi                  NA         NA         NA       NA    
## pictur                1.466e+15  3.441e+07   42617651   <2e-16 ***
## plastic               1.057e+15  3.327e+07   31786046   <2e-16 ***
## play                 -5.436e+14  4.390e+07  -12384144   <2e-16 ***
## pleas                -6.783e+14  2.295e+07  -29552290   <2e-16 ***
## ppl                  -2.814e+14  2.959e+07   -9510826   <2e-16 ***
## preorder             -1.955e+15  2.873e+07  -68045021   <2e-16 ***
## price                -1.269e+15  2.412e+07  -52641453   <2e-16 ***
## print                -3.065e+15  4.089e+07  -74959837   <2e-16 ***
## pro                  -7.326e+14  6.541e+07  -11199687   <2e-16 ***
## problem               2.509e+15  3.855e+07   65097485   <2e-16 ***
## product               1.313e+14  3.426e+07    3833607   <2e-16 ***
## promo                 6.406e+14  7.623e+06   84031965   <2e-16 ***
## promoipodplayerpromo -6.586e+15  5.194e+07 -126814762   <2e-16 ***
## put                  -1.571e+15  3.625e+07  -43331484   <2e-16 ***
## que                  -1.607e+15  2.548e+07  -63043216   <2e-16 ***
## quiet                        NA         NA         NA       NA    
## read                 -2.187e+15  8.949e+07  -24438743   <2e-16 ***
## realli               -1.593e+15  1.961e+07  -81220700   <2e-16 ***
## recommend                    NA         NA         NA       NA    
## refus                -9.655e+15  2.319e+08  -41640481   <2e-16 ***
## releas               -1.462e+15  2.651e+07  -55141122   <2e-16 ***
## right                -1.552e+15  3.356e+07  -46237792   <2e-16 ***
## said                  1.110e+14  4.600e+07    2412887   <2e-16 ***
## samsung              -1.039e+15  2.135e+07  -48669779   <2e-16 ***
## samsungsa                    NA         NA         NA       NA    
## say                  -1.601e+15  2.609e+07  -61362375   <2e-16 ***
## scanner              -4.001e+14  3.764e+07  -10627871   <2e-16 ***
## screen                2.493e+15  3.255e+07   76573764   <2e-16 ***
## secur                -1.067e+14  4.003e+07   -2666218   <2e-16 ***
## see                  -1.291e+15  3.483e+07  -37073237   <2e-16 ***
## seem                  3.884e+14  3.886e+07    9993324   <2e-16 ***
## sell                 -3.030e+14  3.400e+07   -8910856   <2e-16 ***
## send                 -3.829e+15  4.526e+07  -84595793   <2e-16 ***
## servic               -2.238e+15  3.904e+07  -57317962   <2e-16 ***
## shame                 3.629e+15  8.598e+07   42204821   <2e-16 ***
## share                 4.491e+13  3.103e+07    1447190   <2e-16 ***
## short                 1.306e+15  4.595e+07   28423037   <2e-16 ***
## show                 -2.610e+15  4.690e+07  -55655876   <2e-16 ***
## simpl                -1.745e+15  4.840e+07  -36066415   <2e-16 ***
## sinc                 -6.763e+14  3.599e+07  -18791480   <2e-16 ***
## siri                  9.095e+14  2.898e+07   31378661   <2e-16 ***
## smart                -3.270e+15  5.420e+07  -60333073   <2e-16 ***
## smartphon             1.776e+15  4.402e+07   40347795   <2e-16 ***
## someth               -2.423e+15  4.485e+07  -54020029   <2e-16 ***
## soon                 -1.071e+15  5.329e+07  -20090147   <2e-16 ***
## stand                 2.611e+14  4.622e+07    5649417   <2e-16 ***
## start                 2.011e+15  3.619e+07   55555238   <2e-16 ***
## steve                -7.724e+14  3.778e+07  -20444006   <2e-16 ***
## still                 5.952e+14  2.562e+07   23233662   <2e-16 ***
## stop                 -1.169e+15  2.912e+07  -40151745   <2e-16 ***
## store                 2.691e+14  1.566e+07   17178899   <2e-16 ***
## stuff                 1.053e+15  3.041e+07   34630280   <2e-16 ***
## stupid                2.242e+15  3.776e+07   59378393   <2e-16 ***
## suck                  3.312e+15  5.902e+07   56117215   <2e-16 ***
## support              -3.927e+14  2.526e+07  -15545957   <2e-16 ***
## sure                  5.276e+14  2.444e+07   21591174   <2e-16 ***
## switch                1.026e+15  3.713e+07   27636229   <2e-16 ***
## take                  1.454e+15  3.049e+07   47676108   <2e-16 ***
## talk                  1.160e+15  4.092e+07   28344673   <2e-16 ***
## team                  5.116e+14  4.279e+07   11956277   <2e-16 ***
## tech                  1.372e+14  2.826e+07    4855520   <2e-16 ***
## technolog             1.859e+15  4.789e+07   38828394   <2e-16 ***
## tell                 -4.407e+15  3.032e+07 -145319857   <2e-16 ***
## text                 -1.056e+15  2.855e+07  -37005811   <2e-16 ***
## thank                 5.020e+14  1.624e+07   30920933   <2e-16 ***
## that                  5.877e+14  2.743e+07   21425536   <2e-16 ***
## theyr                 4.440e+14  4.140e+07   10725565   <2e-16 ***
## thing                 1.314e+15  2.813e+07   46703319   <2e-16 ***
## think                 5.194e+12  1.731e+07     300098   <2e-16 ***
## tho                  -1.177e+15  4.034e+07  -29171998   <2e-16 ***
## thought               1.410e+15  3.416e+07   41265489   <2e-16 ***
## time                 -1.168e+15  1.910e+07  -61131587   <2e-16 ***
## today                -2.078e+15  3.284e+07  -63274599   <2e-16 ***
## togeth                7.002e+13  4.243e+07    1650277   <2e-16 ***
## touch                -1.855e+15  3.590e+07  -51661762   <2e-16 ***
## touchid              -7.105e+14  3.271e+07  -21718393   <2e-16 ***
## tri                   3.939e+14  2.578e+07   15281257   <2e-16 ***
## true                 -1.195e+15  4.744e+07  -25195949   <2e-16 ***
## turn                 -6.904e+14  3.881e+07  -17786011   <2e-16 ***
## twitter              -1.036e+15  2.130e+07  -48635138   <2e-16 ***
## two                  -3.955e+14  3.708e+07  -10663907   <2e-16 ***
## updat                -8.976e+14  2.349e+07  -38215084   <2e-16 ***
## upgrad                7.140e+14  3.412e+07   20923641   <2e-16 ***
## use                  -6.561e+14  1.998e+07  -32838130   <2e-16 ***
## user                 -8.988e+14  4.875e+07  -18436597   <2e-16 ***
## via                  -1.984e+14  2.744e+07   -7231557   <2e-16 ***
## video                 5.101e+14  3.166e+07   16109357   <2e-16 ***
## wait                 -5.983e+14  2.648e+07  -22596252   <2e-16 ***
## want                  1.260e+15  2.132e+07   59111862   <2e-16 ***
## watch                -2.470e+15  4.225e+07  -58471474   <2e-16 ***
## way                   7.160e+14  2.652e+07   26997060   <2e-16 ***
## week                 -3.224e+14  3.010e+07  -10710831   <2e-16 ***
## well                 -1.755e+14  2.428e+07   -7228548   <2e-16 ***
## what                 -4.408e+13  3.505e+07   -1257553   <2e-16 ***
## white                -1.392e+15  3.818e+07  -36452539   <2e-16 ***
## will                 -9.474e+14  1.517e+07  -62457098   <2e-16 ***
## windowsphon          -3.886e+14  3.529e+07  -11011247   <2e-16 ***
## wish                 -7.643e+14  3.563e+07  -21453368   <2e-16 ***
## without               9.632e+15  1.229e+08   78375458   <2e-16 ***
## wonder               -3.356e+15  4.465e+07  -75160731   <2e-16 ***
## wont                  7.487e+14  2.439e+07   30700123   <2e-16 ***
## work                 -5.250e+14  2.149e+07  -24427395   <2e-16 ***
## world                -9.070e+14  3.288e+07  -27588120   <2e-16 ***
## worst                -6.848e+14  4.205e+07  -16286313   <2e-16 ***
## wow                  -2.659e+15  4.167e+07  -63801460   <2e-16 ***
## wtf                   3.310e+15  3.388e+07   97702039   <2e-16 ***
## yall                 -1.190e+15  2.750e+07  -43266794   <2e-16 ***
## year                  2.548e+15  3.637e+07   70059925   <2e-16 ***
## yes                   1.047e+15  4.189e+07   24989769   <2e-16 ***
## yet                  -8.556e+14  3.875e+07  -22082028   <2e-16 ***
## yooo                  2.500e+15  4.613e+07   54187843   <2e-16 ***
## your                  1.399e+15  2.973e+07   47054552   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance:  708.98  on 825  degrees of freedom
## Residual deviance: 2955.58  on 523  degrees of freedom
## AIC: 3561.6
## 
## Number of Fisher Scoring iterations: 25

prediction

predictlogit = predict(tweetlogit, newdata=testSparse, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
confusionmatrix <- table(testSparse$negative, predictlogit)
confusionmatrix
##        predictlogit
##         2.22044604925031e-16   1
##   FALSE                  253  47
##   TRUE                    27  28
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(testSparse))
## 
## Accuracy 0.7915493

The accuracy is worse than the baseline. The model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set.

A logistic regression model with a large number of variables is particularly at risk for overfitting.

The warning messages from the ‘glm’ function has to do with the number of variables, and the fact that the model is overfitting to the training set.