example from MITx The Analytics as Edge
from Twitter. Tweets about Apple, rated for sentiment.
tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
tweets$negative <- as.factor(tweets$Avg <= -1)
table(tweets$negative)
##
## FALSE TRUE
## 999 182
A corpus is a collection of documents. We’ll need to convert our tweets to a corpus for pre-processing.
corpus <- VCorpus(VectorSource(tweets$Tweet))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1181
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"
Stemming or removing stop words can be done with the tm_map function.
corpus <- tm_map(corpus, content_transformer(tolower))
corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus <- tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
Remove ‘Apple’ (present in all these tweets about Apple) and all English stopwords.
corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]$content
## [1] " say far best customer care service ever received appstore"
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]$content
## [1] "say far best custom care servic ever receiv appstor"
The tm package provides a function called DocumentTermMatrix that generates a matrix where the rows correspond to documents, in our case tweets, and the columns correspond to words in those tweets.
frequencies <- DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
inspect(frequencies[1000:1005, 505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu
## 1000 0 0 0 0 0 0 0 0 0
## 1001 0 0 0 0 0 0 0 0 0
## 1002 0 0 0 0 0 0 0 0 0
## 1003 0 0 0 0 0 0 0 0 0
## 1004 0 0 0 0 0 0 0 0 0
## 1005 0 0 0 0 1 0 0 0 0
## Terms
## Docs child
## 1000 0
## 1001 0
## 1002 0
## 1003 0
## 1004 0
## 1005 0
# documents from 1000 to 1005
# words from 505 to 515
“cheer” appears in tweet 1005. “cheap” doesn’t appear in any of these tweets.
This data is what we call sparse. There are many zeros in our matrix.
What are the most popular (frequent) terms?
findFreqTerms(frequencies, lowfreq = 100)
## [1] "iphon" "itun" "new"
# minimum number of times a term must appear to be displayed is one hundred
findFreqTerms(frequencies, lowfreq = 20)
## [1] "android" "anyon" "app"
## [4] "appl" "back" "batteri"
## [7] "better" "buy" "can"
## [10] "cant" "come" "dont"
## [13] "fingerprint" "freak" "get"
## [16] "googl" "ios7" "ipad"
## [19] "iphon" "iphone5" "iphone5c"
## [22] "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol"
## [28] "look" "love" "make"
## [31] "market" "microsoft" "need"
## [34] "new" "now" "one"
## [37] "phone" "pleas" "promo"
## [40] "promoipodplayerpromo" "realli" "releas"
## [43] "samsung" "say" "store"
## [46] "thank" "think" "time"
## [49] "twitter" "updat" "use"
## [52] "via" "want" "well"
## [55] "will" "work"
# minimum number of times a term must appear to be displayed is twenty
Fifty-six (56) different words appear at least twenty times, out of 3289 words in the matrix. Many words are probably pretty useless for our prediction model.
More terms means more independent variables, which results in longer computation time to build our models.
Ratio of independent variables to observations will affect how good the model will generalize.
sparse <- removeSparseTerms(frequencies, 0.995)
# sparsity threshold
# if 0.98 - keep terms that appear in 2% or more tweets
# if 0.995 - keep terms that appear in 0.5% or more tweets
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
tweetsSparse <- as.data.frame(as.matrix(sparse))
colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))
# some words start with a number, but R struggles with variables names that start with a number
tweetsSparse$negative <- tweets$negative
set.seed(123)
split <- sample.split(tweetsSparse$negative, SplitRatio = 0.7)
trainSparse <- subset(tweetsSparse, split == TRUE)
testSparse <- subset(tweetsSparse, split == FALSE)
tweetCART <- rpart(negative ~ .,
data = trainSparse, method = "class")
prp(tweetCART)
predictCART <- predict(tweetCART, newdata = testSparse, type = "class")
confusionmatrix <- table(testSparse$negative, predictCART)
confusionmatrix
## predictCART
## FALSE TRUE
## FALSE 294 6
## TRUE 37 18
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(testSparse))
##
## Accuracy 0.8788732
where the default is ‘non-negative’ sentiment
negativematrix <- table(testSparse$negative)
cat("\nAccuracy", 300/355)
##
## Accuracy 0.8450704
Takes much longer to build a random forest model, due to the large number of independent variables.
set.seed(123)
tweetRF <- randomForest(negative ~ ., data = trainSparse)
predictRF <- predict(tweetRF, newdata = testSparse)
confusionmatrix <- table(testSparse$negative, predictRF)
confusionmatrix
## predictRF
## FALSE TRUE
## FALSE 293 7
## TRUE 34 21
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(testSparse))
##
## Accuracy 0.884507
tweetlogit <- glm(negative ~ ., data = trainSparse, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(tweetlogit)
##
## Call:
## glm(formula = negative ~ ., family = "binomial", data = trainSparse)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.047e+15 5.330e+06 -196471463 <2e-16 ***
## X244tsuyoponzu 1.047e+15 3.048e+07 34353213 <2e-16 ***
## X7evenstarz 2.690e+15 4.542e+07 59227701 <2e-16 ***
## actual -7.674e+14 2.884e+07 -26611097 <2e-16 ***
## add -8.849e+14 6.062e+07 -14596618 <2e-16 ***
## alreadi -2.947e+14 3.772e+07 -7812707 <2e-16 ***
## alway -7.146e+14 4.813e+07 -14847398 <2e-16 ***
## amaz -6.693e+14 3.977e+07 -16829776 <2e-16 ***
## amazon -2.370e+15 9.950e+07 -23822526 <2e-16 ***
## android 7.892e+14 2.007e+07 39317161 <2e-16 ***
## announc -2.446e+15 2.809e+07 -87079701 <2e-16 ***
## anyon -4.159e+14 2.607e+07 -15950953 <2e-16 ***
## app 3.770e+14 1.556e+07 24219551 <2e-16 ***
## appl -5.837e+14 1.695e+07 -34427793 <2e-16 ***
## appstor -1.635e+15 3.697e+07 -44230461 <2e-16 ***
## arent -1.262e+15 4.287e+07 -29425117 <2e-16 ***
## ask -1.322e+15 3.861e+07 -34243520 <2e-16 ***
## avail -2.055e+15 3.782e+07 -54346409 <2e-16 ***
## away -5.002e+13 4.471e+07 -1118735 <2e-16 ***
## awesom 8.366e+14 3.464e+07 24154252 <2e-16 ***
## back -1.905e+15 2.304e+07 -82683091 <2e-16 ***
## batteri -1.136e+15 3.029e+07 -37503023 <2e-16 ***
## best 1.069e+15 3.759e+07 28445859 <2e-16 ***
## better 3.124e+14 2.169e+07 14401591 <2e-16 ***
## big -2.229e+15 3.069e+07 -72636420 <2e-16 ***
## bit 8.489e+13 2.799e+07 3033255 <2e-16 ***
## black 1.121e+15 3.684e+07 30419212 <2e-16 ***
## blackberri -2.069e+15 3.251e+07 -63650797 <2e-16 ***
## break. 1.791e+15 3.619e+07 49483853 <2e-16 ***
## bring -1.517e+15 3.576e+07 -42430018 <2e-16 ***
## burberri 4.290e+14 3.593e+07 11938729 <2e-16 ***
## busi -3.320e+15 3.130e+07 -106072554 <2e-16 ***
## buy 2.066e+15 2.469e+07 83699753 <2e-16 ***
## call 1.712e+15 3.203e+07 53454609 <2e-16 ***
## can -6.827e+14 1.522e+07 -44842110 <2e-16 ***
## cant 1.170e+15 2.174e+07 53816650 <2e-16 ***
## carbon -1.773e+15 1.194e+08 -14843110 <2e-16 ***
## card 4.728e+15 5.740e+07 82355362 <2e-16 ***
## care 3.600e+15 3.773e+07 95400584 <2e-16 ***
## case 1.364e+14 3.509e+07 3885327 <2e-16 ***
## cdp 1.503e+15 8.722e+07 17233989 <2e-16 ***
## chang -4.918e+13 3.216e+07 -1529048 <2e-16 ***
## charg 1.025e+15 4.422e+07 23173007 <2e-16 ***
## charger -1.167e+14 2.806e+07 -4158504 <2e-16 ***
## cheap 6.726e+13 3.328e+07 2020913 <2e-16 ***
## china 8.542e+14 2.605e+07 32790275 <2e-16 ***
## color -1.923e+15 2.638e+07 -72903150 <2e-16 ***
## colour 2.047e+15 5.176e+07 39555035 <2e-16 ***
## come -3.326e+14 1.973e+07 -16855683 <2e-16 ***
## compani -5.980e+14 2.785e+07 -21474343 <2e-16 ***
## condescens 3.956e+14 6.016e+07 6575506 <2e-16 ***
## condom 2.297e+15 4.130e+07 55607494 <2e-16 ***
## copi 6.395e+14 4.424e+07 14454035 <2e-16 ***
## crack -5.958e+14 4.425e+07 -13464583 <2e-16 ***
## creat -3.711e+14 3.639e+07 -10199949 <2e-16 ***
## custom 3.686e+14 4.341e+07 8492670 <2e-16 ***
## darn 7.905e+14 3.199e+07 24715387 <2e-16 ***
## data 1.420e+15 3.652e+07 38892983 <2e-16 ***
## date -4.303e+14 4.424e+07 -9726799 <2e-16 ***
## day 1.896e+15 2.777e+07 68276510 <2e-16 ***
## dear 9.422e+14 3.169e+07 29734282 <2e-16 ***
## design -4.603e+14 3.663e+07 -12565908 <2e-16 ***
## develop -9.558e+14 2.670e+07 -35795214 <2e-16 ***
## devic -6.933e+14 2.559e+07 -27089897 <2e-16 ***
## didnt 1.184e+15 3.518e+07 33646631 <2e-16 ***
## die 2.589e+14 3.535e+07 7322275 <2e-16 ***
## differ 1.321e+15 3.527e+07 37445641 <2e-16 ***
## disappoint 3.459e+15 3.989e+07 86706178 <2e-16 ***
## discontinu 4.374e+15 1.268e+08 34484248 <2e-16 ***
## divulg 7.415e+15 2.769e+08 26776544 <2e-16 ***
## doesnt -7.889e+14 3.710e+07 -21267737 <2e-16 ***
## done -1.117e+14 3.079e+07 -3626397 <2e-16 ***
## dont -2.275e+14 1.925e+07 -11817691 <2e-16 ***
## download 2.025e+15 4.755e+07 42585707 <2e-16 ***
## drop -5.285e+14 3.445e+07 -15342473 <2e-16 ***
## email -1.334e+15 4.057e+07 -32882439 <2e-16 ***
## emiss 1.402e+15 1.884e+08 7443382 <2e-16 ***
## emoji 2.721e+14 3.459e+07 7864115 <2e-16 ***
## even 2.197e+15 3.862e+07 56894488 <2e-16 ***
## event 5.191e+14 3.959e+07 13111789 <2e-16 ***
## ever 1.262e+15 4.475e+07 28199724 <2e-16 ***
## everi 9.745e+14 2.653e+07 36737236 <2e-16 ***
## everyth 9.316e+14 6.367e+07 14632245 <2e-16 ***
## facebook 2.034e+15 9.430e+07 21570316 <2e-16 ***
## fail -2.029e+15 3.510e+07 -57801020 <2e-16 ***
## featur -1.225e+14 3.656e+07 -3349530 <2e-16 ***
## feel 1.000e+15 3.921e+07 25504940 <2e-16 ***
## femal NA NA NA NA
## figur -2.740e+15 4.623e+07 -59260032 <2e-16 ***
## final -2.953e+15 5.950e+07 -49634975 <2e-16 ***
## finger -2.896e+15 3.418e+07 -84720574 <2e-16 ***
## fingerprint 7.066e+14 2.298e+07 30744148 <2e-16 ***
## fire 3.260e+15 4.933e+07 66074821 <2e-16 ***
## first 2.238e+15 4.033e+07 55482242 <2e-16 ***
## fix -1.043e+13 3.280e+07 -317979 <2e-16 ***
## follow -3.606e+14 2.524e+07 -14287286 <2e-16 ***
## freak 1.865e+15 1.269e+07 146995843 <2e-16 ***
## free -2.040e+14 3.469e+07 -5880535 <2e-16 ***
## fun -1.442e+15 3.393e+07 -42494969 <2e-16 ***
## generat 2.203e+15 4.350e+07 50649733 <2e-16 ***
## genius -2.805e+15 5.186e+07 -54084196 <2e-16 ***
## get -1.915e+14 1.286e+07 -14892955 <2e-16 ***
## give -3.420e+14 2.563e+07 -13340846 <2e-16 ***
## gold -1.719e+15 2.879e+07 -59692323 <2e-16 ***
## gonna -1.788e+15 3.398e+07 -52617649 <2e-16 ***
## good 1.026e+15 3.393e+07 30228875 <2e-16 ***
## googl 1.697e+14 2.303e+07 7369213 <2e-16 ***
## got 1.131e+14 3.399e+07 3327497 <2e-16 ***
## great -1.874e+15 4.510e+07 -41544656 <2e-16 ***
## guess 4.047e+14 4.829e+07 8381102 <2e-16 ***
## guy -1.489e+15 3.129e+07 -47585590 <2e-16 ***
## happen -1.477e+15 4.406e+07 -33520895 <2e-16 ***
## happi -1.022e+15 4.359e+07 -23449803 <2e-16 ***
## hate 2.889e+15 2.513e+07 114941958 <2e-16 ***
## help -2.379e+15 2.513e+07 -94663824 <2e-16 ***
## hey -3.295e+14 2.554e+07 -12900779 <2e-16 ***
## hope 1.719e+15 3.280e+07 52395960 <2e-16 ***
## hour -1.571e+15 3.524e+07 -44561828 <2e-16 ***
## httpbitly18xc8dk -8.622e+14 9.701e+07 -8887484 <2e-16 ***
## ibrooklynb -5.675e+14 3.801e+07 -14931892 <2e-16 ***
## idea 1.124e+15 4.436e+07 25345397 <2e-16 ***
## ill 2.106e+15 5.577e+07 37756437 <2e-16 ***
## imessag 1.849e+15 3.977e+07 46502196 <2e-16 ***
## impress -5.331e+14 3.082e+07 -17295165 <2e-16 ***
## improv 2.019e+15 3.456e+07 58426788 <2e-16 ***
## innov -2.541e+14 2.963e+07 -8576540 <2e-16 ***
## instead -5.642e+14 4.377e+07 -12889118 <2e-16 ***
## internet 2.183e+15 4.725e+07 46200453 <2e-16 ***
## ios7 7.545e+14 2.211e+07 34120577 <2e-16 ***
## ipad 3.161e+14 1.632e+07 19368679 <2e-16 ***
## iphon 4.065e+14 7.159e+06 56790849 <2e-16 ***
## iphone4 3.403e+14 4.920e+07 6916192 <2e-16 ***
## iphone5 -4.880e+14 1.469e+07 -33220704 <2e-16 ***
## iphone5c -1.075e+15 1.722e+07 -62421271 <2e-16 ***
## iphoto -1.909e+15 8.144e+07 -23434643 <2e-16 ***
## ipod -5.428e+14 4.168e+07 -13022484 <2e-16 ***
## ipodplayerpromo 2.216e+15 5.638e+07 39310168 <2e-16 ***
## isnt -1.732e+15 3.342e+07 -51838596 <2e-16 ***
## itun 1.004e+15 2.475e+07 40578840 <2e-16 ***
## ive -2.412e+15 4.976e+07 -48476344 <2e-16 ***
## job 2.604e+15 5.409e+07 48139015 <2e-16 ***
## just 2.243e+14 1.485e+07 15109878 <2e-16 ***
## keynot -1.352e+15 3.632e+07 -37209467 <2e-16 ***
## know -3.530e+14 2.869e+07 -12304243 <2e-16 ***
## last -1.140e+15 4.573e+07 -24921705 <2e-16 ***
## launch -1.950e+15 4.400e+07 -44315729 <2e-16 ***
## let -1.297e+15 3.636e+07 -35664023 <2e-16 ***
## life 1.027e+15 3.133e+07 32779740 <2e-16 ***
## like 3.430e+14 1.380e+07 24850189 <2e-16 ***
## line 1.853e+15 5.901e+07 31393396 <2e-16 ***
## lmao -1.490e+15 4.065e+07 -36654917 <2e-16 ***
## lock -2.961e+15 4.315e+07 -68620829 <2e-16 ***
## lol -1.465e+14 2.226e+07 -6577814 <2e-16 ***
## look -3.959e+14 1.927e+07 -20546515 <2e-16 ***
## los -5.600e+14 6.879e+07 -8141108 <2e-16 ***
## lost 8.202e+14 4.914e+07 16691475 <2e-16 ***
## love -1.509e+15 2.229e+07 -67686561 <2e-16 ***
## mac -3.286e+14 2.872e+07 -11440789 <2e-16 ***
## macbook -8.167e+14 3.650e+07 -22372762 <2e-16 ***
## made -1.594e+15 3.428e+07 -46501317 <2e-16 ***
## make -5.678e+14 1.544e+07 -36771301 <2e-16 ***
## man -3.677e+14 3.996e+07 -9201168 <2e-16 ***
## mani 6.847e+14 4.655e+07 14708969 <2e-16 ***
## market 1.958e+15 2.693e+07 72698006 <2e-16 ***
## mayb -5.275e+14 3.961e+07 -13317002 <2e-16 ***
## mean -1.506e+15 4.017e+07 -37483828 <2e-16 ***
## microsoft -2.728e+14 2.108e+07 -12944707 <2e-16 ***
## mishiza -2.840e+15 3.955e+07 -71799020 <2e-16 ***
## miss 2.176e+15 4.084e+07 53295830 <2e-16 ***
## mobil -1.796e+15 2.734e+07 -65696457 <2e-16 ***
## money -1.217e+15 6.131e+07 -19843102 <2e-16 ***
## motorola 3.469e+14 4.287e+07 8093391 <2e-16 ***
## move -1.632e+15 5.308e+07 -30742354 <2e-16 ***
## much 7.026e+14 3.177e+07 22116210 <2e-16 ***
## music -2.967e+15 6.402e+07 -46337541 <2e-16 ***
## natz0711 NA NA NA NA
## need -5.313e+14 1.821e+07 -29181866 <2e-16 ***
## never -1.550e+15 3.783e+07 -40967153 <2e-16 ***
## new -1.018e+14 1.066e+07 -9546777 <2e-16 ***
## news -8.980e+14 3.033e+07 -29607919 <2e-16 ***
## next. 5.650e+13 2.615e+07 2160383 <2e-16 ***
## nfc -2.881e+15 4.427e+07 -65093797 <2e-16 ***
## nokia -2.656e+14 2.443e+07 -10868917 <2e-16 ***
## noth -5.459e+14 4.374e+07 -12480163 <2e-16 ***
## now -2.744e+14 1.603e+07 -17117798 <2e-16 ***
## nsa 1.423e+13 3.148e+07 451904 <2e-16 ***
## nuevo -1.817e+15 6.661e+07 -27282254 <2e-16 ***
## offer -2.543e+15 3.757e+07 -67681569 <2e-16 ***
## old 6.831e+14 3.002e+07 22751976 <2e-16 ***
## one -9.342e+14 1.629e+07 -57331111 <2e-16 ***
## page -2.348e+15 4.623e+07 -50779375 <2e-16 ***
## para 6.474e+12 3.057e+07 211809 <2e-16 ***
## peopl -2.220e+14 2.695e+07 -8237006 <2e-16 ***
## perfect -1.465e+15 6.262e+07 -23386777 <2e-16 ***
## person 2.251e+15 4.550e+07 49461017 <2e-16 ***
## phone -3.136e+14 1.304e+07 -24040686 <2e-16 ***
## photog NA NA NA NA
## photographi NA NA NA NA
## pictur 1.466e+15 3.441e+07 42617651 <2e-16 ***
## plastic 1.057e+15 3.327e+07 31786046 <2e-16 ***
## play -5.436e+14 4.390e+07 -12384144 <2e-16 ***
## pleas -6.783e+14 2.295e+07 -29552290 <2e-16 ***
## ppl -2.814e+14 2.959e+07 -9510826 <2e-16 ***
## preorder -1.955e+15 2.873e+07 -68045021 <2e-16 ***
## price -1.269e+15 2.412e+07 -52641453 <2e-16 ***
## print -3.065e+15 4.089e+07 -74959837 <2e-16 ***
## pro -7.326e+14 6.541e+07 -11199687 <2e-16 ***
## problem 2.509e+15 3.855e+07 65097485 <2e-16 ***
## product 1.313e+14 3.426e+07 3833607 <2e-16 ***
## promo 6.406e+14 7.623e+06 84031965 <2e-16 ***
## promoipodplayerpromo -6.586e+15 5.194e+07 -126814762 <2e-16 ***
## put -1.571e+15 3.625e+07 -43331484 <2e-16 ***
## que -1.607e+15 2.548e+07 -63043216 <2e-16 ***
## quiet NA NA NA NA
## read -2.187e+15 8.949e+07 -24438743 <2e-16 ***
## realli -1.593e+15 1.961e+07 -81220700 <2e-16 ***
## recommend NA NA NA NA
## refus -9.655e+15 2.319e+08 -41640481 <2e-16 ***
## releas -1.462e+15 2.651e+07 -55141122 <2e-16 ***
## right -1.552e+15 3.356e+07 -46237792 <2e-16 ***
## said 1.110e+14 4.600e+07 2412887 <2e-16 ***
## samsung -1.039e+15 2.135e+07 -48669779 <2e-16 ***
## samsungsa NA NA NA NA
## say -1.601e+15 2.609e+07 -61362375 <2e-16 ***
## scanner -4.001e+14 3.764e+07 -10627871 <2e-16 ***
## screen 2.493e+15 3.255e+07 76573764 <2e-16 ***
## secur -1.067e+14 4.003e+07 -2666218 <2e-16 ***
## see -1.291e+15 3.483e+07 -37073237 <2e-16 ***
## seem 3.884e+14 3.886e+07 9993324 <2e-16 ***
## sell -3.030e+14 3.400e+07 -8910856 <2e-16 ***
## send -3.829e+15 4.526e+07 -84595793 <2e-16 ***
## servic -2.238e+15 3.904e+07 -57317962 <2e-16 ***
## shame 3.629e+15 8.598e+07 42204821 <2e-16 ***
## share 4.491e+13 3.103e+07 1447190 <2e-16 ***
## short 1.306e+15 4.595e+07 28423037 <2e-16 ***
## show -2.610e+15 4.690e+07 -55655876 <2e-16 ***
## simpl -1.745e+15 4.840e+07 -36066415 <2e-16 ***
## sinc -6.763e+14 3.599e+07 -18791480 <2e-16 ***
## siri 9.095e+14 2.898e+07 31378661 <2e-16 ***
## smart -3.270e+15 5.420e+07 -60333073 <2e-16 ***
## smartphon 1.776e+15 4.402e+07 40347795 <2e-16 ***
## someth -2.423e+15 4.485e+07 -54020029 <2e-16 ***
## soon -1.071e+15 5.329e+07 -20090147 <2e-16 ***
## stand 2.611e+14 4.622e+07 5649417 <2e-16 ***
## start 2.011e+15 3.619e+07 55555238 <2e-16 ***
## steve -7.724e+14 3.778e+07 -20444006 <2e-16 ***
## still 5.952e+14 2.562e+07 23233662 <2e-16 ***
## stop -1.169e+15 2.912e+07 -40151745 <2e-16 ***
## store 2.691e+14 1.566e+07 17178899 <2e-16 ***
## stuff 1.053e+15 3.041e+07 34630280 <2e-16 ***
## stupid 2.242e+15 3.776e+07 59378393 <2e-16 ***
## suck 3.312e+15 5.902e+07 56117215 <2e-16 ***
## support -3.927e+14 2.526e+07 -15545957 <2e-16 ***
## sure 5.276e+14 2.444e+07 21591174 <2e-16 ***
## switch 1.026e+15 3.713e+07 27636229 <2e-16 ***
## take 1.454e+15 3.049e+07 47676108 <2e-16 ***
## talk 1.160e+15 4.092e+07 28344673 <2e-16 ***
## team 5.116e+14 4.279e+07 11956277 <2e-16 ***
## tech 1.372e+14 2.826e+07 4855520 <2e-16 ***
## technolog 1.859e+15 4.789e+07 38828394 <2e-16 ***
## tell -4.407e+15 3.032e+07 -145319857 <2e-16 ***
## text -1.056e+15 2.855e+07 -37005811 <2e-16 ***
## thank 5.020e+14 1.624e+07 30920933 <2e-16 ***
## that 5.877e+14 2.743e+07 21425536 <2e-16 ***
## theyr 4.440e+14 4.140e+07 10725565 <2e-16 ***
## thing 1.314e+15 2.813e+07 46703319 <2e-16 ***
## think 5.194e+12 1.731e+07 300098 <2e-16 ***
## tho -1.177e+15 4.034e+07 -29171998 <2e-16 ***
## thought 1.410e+15 3.416e+07 41265489 <2e-16 ***
## time -1.168e+15 1.910e+07 -61131587 <2e-16 ***
## today -2.078e+15 3.284e+07 -63274599 <2e-16 ***
## togeth 7.002e+13 4.243e+07 1650277 <2e-16 ***
## touch -1.855e+15 3.590e+07 -51661762 <2e-16 ***
## touchid -7.105e+14 3.271e+07 -21718393 <2e-16 ***
## tri 3.939e+14 2.578e+07 15281257 <2e-16 ***
## true -1.195e+15 4.744e+07 -25195949 <2e-16 ***
## turn -6.904e+14 3.881e+07 -17786011 <2e-16 ***
## twitter -1.036e+15 2.130e+07 -48635138 <2e-16 ***
## two -3.955e+14 3.708e+07 -10663907 <2e-16 ***
## updat -8.976e+14 2.349e+07 -38215084 <2e-16 ***
## upgrad 7.140e+14 3.412e+07 20923641 <2e-16 ***
## use -6.561e+14 1.998e+07 -32838130 <2e-16 ***
## user -8.988e+14 4.875e+07 -18436597 <2e-16 ***
## via -1.984e+14 2.744e+07 -7231557 <2e-16 ***
## video 5.101e+14 3.166e+07 16109357 <2e-16 ***
## wait -5.983e+14 2.648e+07 -22596252 <2e-16 ***
## want 1.260e+15 2.132e+07 59111862 <2e-16 ***
## watch -2.470e+15 4.225e+07 -58471474 <2e-16 ***
## way 7.160e+14 2.652e+07 26997060 <2e-16 ***
## week -3.224e+14 3.010e+07 -10710831 <2e-16 ***
## well -1.755e+14 2.428e+07 -7228548 <2e-16 ***
## what -4.408e+13 3.505e+07 -1257553 <2e-16 ***
## white -1.392e+15 3.818e+07 -36452539 <2e-16 ***
## will -9.474e+14 1.517e+07 -62457098 <2e-16 ***
## windowsphon -3.886e+14 3.529e+07 -11011247 <2e-16 ***
## wish -7.643e+14 3.563e+07 -21453368 <2e-16 ***
## without 9.632e+15 1.229e+08 78375458 <2e-16 ***
## wonder -3.356e+15 4.465e+07 -75160731 <2e-16 ***
## wont 7.487e+14 2.439e+07 30700123 <2e-16 ***
## work -5.250e+14 2.149e+07 -24427395 <2e-16 ***
## world -9.070e+14 3.288e+07 -27588120 <2e-16 ***
## worst -6.848e+14 4.205e+07 -16286313 <2e-16 ***
## wow -2.659e+15 4.167e+07 -63801460 <2e-16 ***
## wtf 3.310e+15 3.388e+07 97702039 <2e-16 ***
## yall -1.190e+15 2.750e+07 -43266794 <2e-16 ***
## year 2.548e+15 3.637e+07 70059925 <2e-16 ***
## yes 1.047e+15 4.189e+07 24989769 <2e-16 ***
## yet -8.556e+14 3.875e+07 -22082028 <2e-16 ***
## yooo 2.500e+15 4.613e+07 54187843 <2e-16 ***
## your 1.399e+15 2.973e+07 47054552 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 708.98 on 825 degrees of freedom
## Residual deviance: 2955.58 on 523 degrees of freedom
## AIC: 3561.6
##
## Number of Fisher Scoring iterations: 25
predictlogit = predict(tweetlogit, newdata=testSparse, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
confusionmatrix <- table(testSparse$negative, predictlogit)
confusionmatrix
## predictlogit
## 2.22044604925031e-16 1
## FALSE 253 47
## TRUE 27 28
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(testSparse))
##
## Accuracy 0.7915493
The accuracy is worse than the baseline. The model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set.
A logistic regression model with a large number of variables is particularly at risk for overfitting.
The warning messages from the ‘glm’ function has to do with the number of variables, and the fact that the model is overfitting to the training set.