Source: Analytics Edge Unit 5 Lecture

Since we’re working with text data here, we need one extra argument, which is stringsAsFactors=FALSE.

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_5_Text_analytics")
tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

We’re more interested in being able to detect the tweets with clear negative sentiment, so let’s define a new variable in our data set tweets called Negative.

tweets$Negative <- as.factor(tweets$Avg <= -1)
table(tweets$Negative)
## 
## FALSE  TRUE 
##   999   182

Now to pre-process our text data so that we can use the bag of words approach, we’ll be using the tm text mining package.

library(tm)
## Loading required package: NLP
library(SnowballC)

One of the concepts introduced by the tm package is that of a corpus. A corpus is a collection of documents. We’ll need to convert our tweets to a corpus for pre-processing. tm can create a corpus in many different ways, but we’ll create it from the tweet column of our data frame using two functions, Corpus and VectorSource.

corpus <- Corpus(VectorSource(tweets$Tweet))
corpus
## <<VCorpus (documents: 1181, metadata (corpus/indexed): 0/0)>>
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore

Preprocess the data

Let’s try it out by changing all of the text in our tweets to lowercase.

corpus <- tm_map(corpus, tolower)
corpus[[1]]
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus = tm_map(corpus, PlainTextDocument)

Now let’s remove all punctuation.

corpus <- tm_map(corpus, removePunctuation)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## i have to say apple has by far the best customer care service i have ever received apple appstore

Now we want to remove the stop words in our tweets.

tm provides a list of stop words for the English language. We can check it out by typing stopwords("english")[1:10]

Removing words can be done with the removeWords argument to the tm_map function, but we need one extra argument this time– what the stop words are that we want to remove.

We’ll remove all of these English stop words, but we’ll also remove the word “apple” since all of these tweets have the word “apple” and it probably won’t be very useful in our prediction problem.

corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
##    say    far  best customer care service   ever received  appstore

Lastly, we want to stem our document with the stemDocument argument.

corpus <- tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
##    say    far  best custom care servic   ever receiv  appstor

Bag of words in R

The tm package provides a function called DocumentTermMatrix that generates a matrix where the rows correspond to documents, in our case tweets, and the columns correspond to words in those tweets. The values in the matrix are the number of times that word appears in each document.

We can see that there are 3,289 terms or words in our matrix and 1,181 documents or tweets after preprocessing.

frequencies <- DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

Let’s see what this matrix looks like using the inspect function. In this range we see that the word “cheer” appears in the tweet 1005, but “cheap” doesn’t appear in any of these tweets.

inspect(frequencies[1000:1005, 505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     1       0         0     0
##               Terms
## Docs           chiiiiqu child children
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0

We can look at what the most popular terms are, or words, with the function findFreqTerms. Then we want to give an argument lowFreq, which is equal to the minimum number of times a term must appear to be displayed.

findFreqTerms(frequencies, lowfreq = 20)
##  [1] "android"              "anyon"                "app"                 
##  [4] "appl"                 "back"                 "batteri"             
##  [7] "better"               "buy"                  "can"                 
## [10] "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                 
## [16] "googl"                "ios7"                 "ipad"                
## [19] "iphon"                "iphone5"              "iphone5c"            
## [22] "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                 
## [28] "look"                 "love"                 "make"                
## [31] "market"               "microsoft"            "need"                
## [34] "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"               
## [40] "promoipodplayerpromo" "realli"               "releas"              
## [43] "samsung"              "say"                  "store"               
## [46] "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                 
## [52] "via"                  "want"                 "well"                
## [55] "will"                 "work"

So out of the 3,289 words in our matrix, only 56 words appear at least 20 times in our tweets. This means that we probably have a lot of terms that will be pretty useless for our prediction model.

The number of terms is an issue for two main reasons. One is computational. More terms means more independent variables, which usually means it takes longer to build our models. The other is in building models, as we mentioned before, the ratio of independent variables to observations will affect how good the model will generalize. So let’s remove some terms that don’t appear very often.

sparse <- removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

Note that the 2nd argument is sparsity threshold, which works as follows. If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets. If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.

Now let’s convert the sparse matrix into a data frame that we’ll be able to use for our predictive models.

tweetsSparse <- as.data.frame(as.matrix(sparse))

Since R struggles with variable names that start with a number, and we probably have some words here that start with a number, let’s run the make.names function to make sure all of our words are appropriate variable names.

Then let’s add our dependent variable to this data set.

colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))
tweetsSparse$Negative <- tweets$Negative

Split the dataset

library(caTools)
set.seed(123)
split <- sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse <- subset(tweetsSparse, split == TRUE)
testSparse <- subset(tweetsSparse, split == FALSE)

predicting sentiment

Use CART to build a predictive model. We’re just using the default parameter settings so we won’t add anything for minbucket or cp.

library(rpart)
library(rpart.plot)
tweetCART <- rpart(Negative ~., data = trainSparse, method = "class")
prp(tweetCART)

Our tree says that if the word “freak” is in the tweet, then predict TRUE, or negative sentiment. If the word “freak” is not in the tweet, but the word “hate” is, again predict TRUE.

Make predictions

predictCART <- predict(tweetCART, newdata = testSparse, type = "class")
table(testSparse$Negative, predictCART)
##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18

The model accuracy is 0.88

(294+18)/nrow(testSparse)
## [1] 0.8788732

Baseline model

table(testSparse$Negative)
## 
## FALSE  TRUE 
##   300    55

baseline model accuracy is 0.845.

300/nrow(testSparse)
## [1] 0.8450704

So our CART model does better than the simple baseline model. How about a random forest model?

We’ll again use the default parameter settings.

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
tweetRF <- randomForest(Negative ~., data = trainSparse)

The random forest model takes significantly longer to build than the CART model. We’ve seen this before when building CART and random forest models, but in this case, the difference is particularly drastic. This is because we have so many independent variables, about 300 different words. So far in this course, we haven’t seen data sets with this many independent variables.

Make predictions

predictRF <- predict(tweetRF, newdata = testSparse)
table(testSparse$Negative, predictRF)
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21

Model accuracy is 0.8845.

(292+23)/nrow(testSparse)
## [1] 0.8873239

This is a little better than our CART model, but due to the interpretability of our CART model, I’d probably prefer it over the random forest model.

If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would increase to about the same as the random forest model.

Build a logistic regression model

tweetLog <- glm(Negative ~., data = trainSparse, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(tweetLog)
## 
## Call:
## glm(formula = Negative ~ ., family = binomial, data = trainSparse)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -8.49    0.00    0.00    0.00    8.49  
## 
## Coefficients: (7 not defined because of singularities)
##                        Estimate Std. Error    z value Pr(>|z|)    
## (Intercept)          -9.920e+14  5.330e+06 -186121946   <2e-16 ***
## X244tsuyoponzu       -3.512e+15  3.048e+07 -115204706   <2e-16 ***
## X7evenstarz           3.320e+15  4.542e+07   73106681   <2e-16 ***
## actual               -6.510e+14  2.884e+07  -22576632   <2e-16 ***
## add                  -2.485e+14  6.062e+07   -4098491   <2e-16 ***
## alreadi              -4.332e+14  3.772e+07  -11484882   <2e-16 ***
## alway                -2.046e+15  4.813e+07  -42509986   <2e-16 ***
## amaz                 -9.769e+14  3.977e+07  -24563922   <2e-16 ***
## amazon               -4.872e+15  9.950e+07  -48965124   <2e-16 ***
## android               3.365e+14  2.007e+07   16760975   <2e-16 ***
## announc              -1.801e+15  2.809e+07  -64113050   <2e-16 ***
## anyon                -6.552e+14  2.607e+07  -25128832   <2e-16 ***
## app                   2.738e+13  1.556e+07    1758907   <2e-16 ***
## appl                 -6.912e+14  1.695e+07  -40772055   <2e-16 ***
## appstor              -2.524e+15  3.697e+07  -68257124   <2e-16 ***
## arent                -7.446e+14  4.287e+07  -17367903   <2e-16 ***
## ask                  -6.210e+14  3.861e+07  -16084279   <2e-16 ***
## avail                -1.875e+15  3.782e+07  -49577225   <2e-16 ***
## away                  7.154e+14  4.471e+07   15999994   <2e-16 ***
## awesom                6.460e+14  3.464e+07   18651081   <2e-16 ***
## back                 -1.553e+15  2.304e+07  -67415341   <2e-16 ***
## batteri              -1.241e+15  3.029e+07  -40971465   <2e-16 ***
## best                  1.381e+15  3.759e+07   36735272   <2e-16 ***
## better                1.045e+15  2.169e+07   48188751   <2e-16 ***
## big                  -3.075e+15  3.069e+07 -100201087   <2e-16 ***
## bit                  -5.128e+13  2.799e+07   -1832211   <2e-16 ***
## black                 1.791e+15  3.684e+07   48610445   <2e-16 ***
## blackberri           -2.497e+15  3.251e+07  -76822782   <2e-16 ***
## break.               -1.529e+14  3.619e+07   -4224898   <2e-16 ***
## bring                -1.381e+15  3.576e+07  -38613216   <2e-16 ***
## burberri             -8.594e+14  3.593e+07  -23917714   <2e-16 ***
## busi                 -3.389e+15  3.130e+07 -108267519   <2e-16 ***
## buy                   1.431e+15  2.469e+07   57973197   <2e-16 ***
## call                  1.734e+15  3.203e+07   54125191   <2e-16 ***
## can                  -6.691e+14  1.522e+07  -43949600   <2e-16 ***
## cant                  1.008e+15  2.174e+07   46350947   <2e-16 ***
## carbon               -4.176e+15  1.194e+08  -34968909   <2e-16 ***
## card                  5.505e+15  5.740e+07   95897290   <2e-16 ***
## care                  2.252e+15  3.773e+07   59687045   <2e-16 ***
## case                  5.282e+14  3.509e+07   15051515   <2e-16 ***
## cdp                   2.523e+15  8.722e+07   28927566   <2e-16 ***
## chang                -5.299e+14  3.216e+07  -16477376   <2e-16 ***
## charg                -1.951e+14  4.422e+07   -4411679   <2e-16 ***
## charger               1.186e+15  2.806e+07   42286065   <2e-16 ***
## cheap                -6.602e+14  3.328e+07  -19836513   <2e-16 ***
## china                 8.324e+14  2.605e+07   31955599   <2e-16 ***
## color                -2.420e+15  2.638e+07  -91741761   <2e-16 ***
## colour                2.046e+15  5.176e+07   39521936   <2e-16 ***
## come                 -3.153e+14  1.973e+07  -15975480   <2e-16 ***
## compani              -7.323e+14  2.785e+07  -26296035   <2e-16 ***
## condescens           -1.414e+14  6.016e+07   -2349754   <2e-16 ***
## condom                9.909e+14  4.130e+07   23993539   <2e-16 ***
## copi                 -4.908e+14  4.424e+07  -11092514   <2e-16 ***
## crack                -2.371e+14  4.425e+07   -5358695   <2e-16 ***
## creat                 6.822e+14  3.639e+07   18748473   <2e-16 ***
## custom                5.266e+14  4.341e+07   12131948   <2e-16 ***
## darn                  9.610e+14  3.199e+07   30045163   <2e-16 ***
## data                  7.456e+14  3.652e+07   20418729   <2e-16 ***
## date                 -6.981e+14  4.424e+07  -15779974   <2e-16 ***
## day                   1.028e+15  2.777e+07   37036060   <2e-16 ***
## dear                  2.592e+14  3.169e+07    8180035   <2e-16 ***
## design               -9.111e+14  3.663e+07  -24872411   <2e-16 ***
## develop              -8.324e+14  2.670e+07  -31175032   <2e-16 ***
## devic                -1.339e+15  2.559e+07  -52321213   <2e-16 ***
## didnt                 1.604e+15  3.518e+07   45595606   <2e-16 ***
## die                   1.491e+14  3.535e+07    4218446   <2e-16 ***
## differ                1.794e+15  3.527e+07   50872052   <2e-16 ***
## disappoint            3.634e+15  3.989e+07   91103769   <2e-16 ***
## discontinu            5.583e+15  1.268e+08   44010243   <2e-16 ***
## divulg                1.254e+16  2.769e+08   45283069   <2e-16 ***
## doesnt               -7.841e+14  3.710e+07  -21138693   <2e-16 ***
## done                 -7.802e+14  3.079e+07  -25337588   <2e-16 ***
## dont                 -2.492e+14  1.925e+07  -12944299   <2e-16 ***
## download              1.767e+15  4.755e+07   37147973   <2e-16 ***
## drop                 -8.428e+14  3.445e+07  -24466250   <2e-16 ***
## email                -1.539e+15  4.057e+07  -37937950   <2e-16 ***
## emiss                 3.488e+15  1.884e+08   18512319   <2e-16 ***
## emoji                 1.155e+14  3.459e+07    3337948   <2e-16 ***
## even                  1.389e+15  3.862e+07   35965471   <2e-16 ***
## event                 5.767e+13  3.959e+07    1456666   <2e-16 ***
## ever                  8.057e+14  4.475e+07   18003760   <2e-16 ***
## everi                 9.798e+14  2.653e+07   36936978   <2e-16 ***
## everyth              -1.428e+15  6.367e+07  -22431131   <2e-16 ***
## facebook              4.670e+15  9.430e+07   49518767   <2e-16 ***
## fail                 -2.134e+15  3.510e+07  -60791853   <2e-16 ***
## featur               -1.217e+15  3.656e+07  -33293867   <2e-16 ***
## feel                  7.390e+14  3.921e+07   18847453   <2e-16 ***
## femal                        NA         NA         NA       NA    
## figur                -2.093e+15  4.623e+07  -45267455   <2e-16 ***
## final                -3.916e+15  5.950e+07  -65813277   <2e-16 ***
## finger               -2.286e+15  3.418e+07  -66875216   <2e-16 ***
## fingerprint           6.267e+14  2.298e+07   27267211   <2e-16 ***
## fire                  2.666e+15  4.933e+07   54035887   <2e-16 ***
## first                 2.272e+15  4.033e+07   56339922   <2e-16 ***
## fix                  -9.437e+14  3.280e+07  -28771572   <2e-16 ***
## follow                3.692e+13  2.524e+07    1462504   <2e-16 ***
## freak                 2.097e+15  1.269e+07  165225698   <2e-16 ***
## free                  2.114e+15  3.469e+07   60939553   <2e-16 ***
## fun                  -1.567e+15  3.393e+07  -46201740   <2e-16 ***
## generat               1.852e+15  4.350e+07   42573374   <2e-16 ***
## genius               -2.628e+15  5.186e+07  -50667479   <2e-16 ***
## get                  -6.566e+13  1.286e+07   -5106639   <2e-16 ***
## give                 -3.812e+14  2.563e+07  -14870312   <2e-16 ***
## gold                 -1.559e+15  2.879e+07  -54155667   <2e-16 ***
## gonna                -1.788e+15  3.398e+07  -52618673   <2e-16 ***
## good                  1.136e+15  3.393e+07   33471389   <2e-16 ***
## googl                 1.467e+14  2.303e+07    6369817   <2e-16 ***
## got                   4.316e+13  3.399e+07    1269862   <2e-16 ***
## great                -2.490e+15  4.510e+07  -55196981   <2e-16 ***
## guess                -1.447e+15  4.829e+07  -29963925   <2e-16 ***
## guy                  -1.888e+15  3.129e+07  -60353010   <2e-16 ***
## happen               -4.466e+14  4.406e+07  -10135592   <2e-16 ***
## happi                 3.524e+13  4.359e+07     808339   <2e-16 ***
## hate                  2.864e+15  2.513e+07  113942314   <2e-16 ***
## help                 -2.217e+15  2.513e+07  -88221557   <2e-16 ***
## hey                  -7.702e+14  2.554e+07  -30158081   <2e-16 ***
## hope                  1.540e+15  3.280e+07   46963460   <2e-16 ***
## hour                 -1.241e+15  3.524e+07  -35216396   <2e-16 ***
## httpbitly18xc8dk     -3.519e+15  9.701e+07  -36270736   <2e-16 ***
## ibrooklynb           -7.461e+14  3.801e+07  -19631078   <2e-16 ***
## idea                 -3.151e+14  4.436e+07   -7104439   <2e-16 ***
## ill                   1.592e+15  5.577e+07   28550971   <2e-16 ***
## imessag               2.580e+15  3.977e+07   64866978   <2e-16 ***
## impress              -1.585e+15  3.082e+07  -51432568   <2e-16 ***
## improv                2.459e+15  3.456e+07   71148844   <2e-16 ***
## innov                -2.897e+14  2.963e+07   -9779021   <2e-16 ***
## instead              -9.517e+14  4.377e+07  -21741191   <2e-16 ***
## internet              3.517e+15  4.725e+07   74424141   <2e-16 ***
## ios7                  1.077e+15  2.211e+07   48720725   <2e-16 ***
## ipad                  3.842e+14  1.632e+07   23536437   <2e-16 ***
## iphon                 4.034e+14  7.159e+06   56348498   <2e-16 ***
## iphone4               8.370e+14  4.920e+07   17011644   <2e-16 ***
## iphone5              -2.650e+14  1.469e+07  -18039241   <2e-16 ***
## iphone5c             -4.317e+14  1.722e+07  -25076039   <2e-16 ***
## iphoto               -3.675e+15  8.144e+07  -45124501   <2e-16 ***
## ipod                 -2.373e+14  4.168e+07   -5693496   <2e-16 ***
## ipodplayerpromo       6.606e+15  7.369e+07   89647465   <2e-16 ***
## isnt                 -1.441e+15  3.342e+07  -43103965   <2e-16 ***
## itun                  7.812e+14  2.475e+07   31560052   <2e-16 ***
## ive                  -1.262e+15  4.976e+07  -25356172   <2e-16 ***
## job                   2.602e+15  5.409e+07   48117038   <2e-16 ***
## just                  8.734e+13  1.485e+07    5882622   <2e-16 ***
## keynot               -1.334e+15  3.632e+07  -36721219   <2e-16 ***
## know                 -4.863e+14  2.869e+07  -16949842   <2e-16 ***
## last                 -1.376e+15  4.573e+07  -30084509   <2e-16 ***
## launch               -1.608e+15  4.400e+07  -36542265   <2e-16 ***
## let                  -1.004e+15  3.636e+07  -27622165   <2e-16 ***
## life                  1.954e+15  3.133e+07   62366326   <2e-16 ***
## like                  5.764e+14  1.380e+07   41765284   <2e-16 ***
## line                  1.388e+15  5.901e+07   23515558   <2e-16 ***
## lmao                 -1.409e+15  4.065e+07  -34661281   <2e-16 ***
## lock                 -2.337e+15  4.315e+07  -54161302   <2e-16 ***
## lol                  -6.485e+14  2.226e+07  -29127812   <2e-16 ***
## look                  2.656e+14  1.927e+07   13782493   <2e-16 ***
## los                  -9.306e+14  6.879e+07  -13528659   <2e-16 ***
## lost                 -6.318e+14  4.914e+07  -12858075   <2e-16 ***
## love                 -1.666e+15  2.229e+07  -74734388   <2e-16 ***
## mac                   3.766e+14  2.872e+07   13113385   <2e-16 ***
## macbook              -9.998e+14  3.650e+07  -27388234   <2e-16 ***
## made                 -8.935e+14  3.428e+07  -26062597   <2e-16 ***
## make                 -6.734e+14  1.544e+07  -43616051   <2e-16 ***
## man                  -9.857e+14  3.996e+07  -24666737   <2e-16 ***
## mani                  1.215e+15  4.655e+07   26103391   <2e-16 ***
## market                2.390e+15  2.693e+07   88733398   <2e-16 ***
## mayb                 -3.745e+14  3.961e+07   -9455839   <2e-16 ***
## mean                 -1.588e+15  4.017e+07  -39534427   <2e-16 ***
## microsoft            -3.889e+14  2.108e+07  -18452324   <2e-16 ***
## mishiza              -1.820e+15  3.955e+07  -46005474   <2e-16 ***
## miss                  7.058e+14  4.084e+07   17282812   <2e-16 ***
## mobil                -1.577e+15  2.734e+07  -57687714   <2e-16 ***
## money                 1.905e+15  6.131e+07   31068521   <2e-16 ***
## motorola              2.762e+14  4.287e+07    6441942   <2e-16 ***
## move                  8.100e+14  5.308e+07   15260277   <2e-16 ***
## much                  8.625e+14  3.177e+07   27149904   <2e-16 ***
## music                -2.047e+15  6.402e+07  -31972989   <2e-16 ***
## natz0711                     NA         NA         NA       NA    
## need                 -1.031e+15  1.821e+07  -56634219   <2e-16 ***
## never                -1.312e+15  3.783e+07  -34672497   <2e-16 ***
## new                  -1.944e+14  1.066e+07  -18233735   <2e-16 ***
## news                 -4.306e+14  3.033e+07  -14198849   <2e-16 ***
## next.                 8.719e+13  2.615e+07    3334199   <2e-16 ***
## nfc                  -3.264e+15  4.427e+07  -73747737   <2e-16 ***
## nokia                 4.489e+14  2.443e+07   18373562   <2e-16 ***
## noth                  5.356e+14  4.374e+07   12245957   <2e-16 ***
## now                  -2.905e+14  1.603e+07  -18118583   <2e-16 ***
## nsa                   1.054e+15  3.148e+07   33483727   <2e-16 ***
## nuevo                -1.511e+15  6.661e+07  -22686533   <2e-16 ***
## offer                -2.593e+15  3.757e+07  -69009890   <2e-16 ***
## old                   3.533e+14  3.002e+07   11768190   <2e-16 ***
## one                  -1.058e+15  1.629e+07  -64937198   <2e-16 ***
## page                 -2.668e+15  4.623e+07  -57713378   <2e-16 ***
## para                 -2.153e+13  3.057e+07    -704422   <2e-16 ***
## peopl                 5.416e+14  2.695e+07   20094078   <2e-16 ***
## perfect              -2.438e+15  6.262e+07  -38931998   <2e-16 ***
## person                2.290e+15  4.550e+07   50334918   <2e-16 ***
## phone                -7.809e+13  1.304e+07   -5986305   <2e-16 ***
## photog                       NA         NA         NA       NA    
## photographi                  NA         NA         NA       NA    
## pictur                7.211e+14  3.441e+07   20955765   <2e-16 ***
## plastic               9.234e+14  3.327e+07   27755114   <2e-16 ***
## play                 -1.052e+15  4.390e+07  -23965058   <2e-16 ***
## pleas                -5.337e+14  2.295e+07  -23251584   <2e-16 ***
## ppl                  -1.572e+14  2.959e+07   -5312932   <2e-16 ***
## preorder             -2.024e+15  2.873e+07  -70462512   <2e-16 ***
## price                -9.421e+14  2.412e+07  -39068684   <2e-16 ***
## print                -2.838e+15  4.089e+07  -69403665   <2e-16 ***
## pro                  -9.754e+14  6.541e+07  -14911571   <2e-16 ***
## problem               1.375e+15  3.855e+07   35669826   <2e-16 ***
## product               4.819e+14  3.426e+07   14068332   <2e-16 ***
## promo                -3.915e+15  4.806e+07  -81457922   <2e-16 ***
## promoipodplayerpromo -1.094e+16  7.035e+07 -155553600   <2e-16 ***
## put                  -9.324e+14  3.625e+07  -25723463   <2e-16 ***
## que                  -1.480e+15  2.548e+07  -58060104   <2e-16 ***
## quiet                        NA         NA         NA       NA    
## read                  4.265e+14  8.949e+07    4765958   <2e-16 ***
## realli               -1.509e+15  1.961e+07  -76933878   <2e-16 ***
## recommend                    NA         NA         NA       NA    
## refus                -1.072e+16  2.319e+08  -46252121   <2e-16 ***
## releas               -1.722e+15  2.651e+07  -64967621   <2e-16 ***
## right                -2.669e+15  3.356e+07  -79508547   <2e-16 ***
## said                 -2.922e+14  4.600e+07   -6351751   <2e-16 ***
## samsung              -1.242e+15  2.135e+07  -58197926   <2e-16 ***
## samsungsa                    NA         NA         NA       NA    
## say                  -1.500e+15  2.609e+07  -57480690   <2e-16 ***
## scanner              -8.666e+14  3.764e+07  -23020957   <2e-16 ***
## screen                2.259e+15  3.255e+07   69383914   <2e-16 ***
## secur                 6.985e+12  4.003e+07     174475   <2e-16 ***
## see                  -1.623e+15  3.483e+07  -46614830   <2e-16 ***
## seem                 -9.348e+12  3.886e+07    -240535   <2e-16 ***
## sell                  2.744e+14  3.400e+07    8071396   <2e-16 ***
## send                 -3.054e+15  4.526e+07  -67467247   <2e-16 ***
## servic               -2.318e+15  3.904e+07  -59374031   <2e-16 ***
## shame                 5.907e+15  8.598e+07   68703308   <2e-16 ***
## share                 2.398e+14  3.103e+07    7727118   <2e-16 ***
## short                 2.156e+15  4.595e+07   46911907   <2e-16 ***
## show                 -1.467e+15  4.690e+07  -31288005   <2e-16 ***
## simpl                -4.349e+13  4.840e+07    -898616   <2e-16 ***
## sinc                 -1.113e+15  3.599e+07  -30920970   <2e-16 ***
## siri                  9.395e+14  2.898e+07   32413223   <2e-16 ***
## smart                -3.077e+15  5.420e+07  -56761341   <2e-16 ***
## smartphon             2.800e+15  4.402e+07   63611401   <2e-16 ***
## someth               -1.560e+15  4.485e+07  -34790507   <2e-16 ***
## soon                 -1.672e+14  5.329e+07   -3136755   <2e-16 ***
## stand                 2.732e+14  4.622e+07    5910391   <2e-16 ***
## start                 2.123e+15  3.619e+07   58667153   <2e-16 ***
## steve                -4.293e+14  3.778e+07  -11364073   <2e-16 ***
## still                 7.638e+14  2.562e+07   29817436   <2e-16 ***
## stop                 -1.240e+15  2.912e+07  -42578608   <2e-16 ***
## store                 1.864e+14  1.566e+07   11900466   <2e-16 ***
## stuff                 4.245e+14  3.041e+07   13960403   <2e-16 ***
## stupid                2.763e+15  3.776e+07   73172248   <2e-16 ***
## suck                  2.599e+15  5.902e+07   44037893   <2e-16 ***
## support              -9.165e+14  2.526e+07  -36281575   <2e-16 ***
## sure                  6.069e+14  2.444e+07   24836843   <2e-16 ***
## switch                1.249e+15  3.713e+07   33644639   <2e-16 ***
## take                  1.724e+15  3.049e+07   56537983   <2e-16 ***
## talk                  1.025e+15  4.092e+07   25040996   <2e-16 ***
## team                  1.050e+15  4.279e+07   24538170   <2e-16 ***
## tech                 -2.888e+14  2.826e+07  -10217535   <2e-16 ***
## technolog             1.007e+15  4.789e+07   21037368   <2e-16 ***
## tell                 -5.451e+15  3.032e+07 -179775002   <2e-16 ***
## text                 -1.379e+15  2.855e+07  -48298573   <2e-16 ***
## thank                 5.329e+14  1.624e+07   32822779   <2e-16 ***
## that                  1.118e+15  2.743e+07   40760352   <2e-16 ***
## theyr                 2.258e+15  4.140e+07   54550146   <2e-16 ***
## thing                 1.728e+14  2.813e+07    6144387   <2e-16 ***
## think                 7.068e+13  1.731e+07    4083295   <2e-16 ***
## tho                  -1.151e+15  4.034e+07  -28540721   <2e-16 ***
## thought               1.268e+15  3.416e+07   37105675   <2e-16 ***
## time                 -1.047e+15  1.910e+07  -54839450   <2e-16 ***
## today                -1.667e+15  3.284e+07  -50750509   <2e-16 ***
## togeth               -2.722e+14  4.243e+07   -6414600   <2e-16 ***
## touch                -2.081e+15  3.590e+07  -57972514   <2e-16 ***
## touchid              -5.763e+14  3.271e+07  -17618155   <2e-16 ***
## tri                  -4.530e+13  2.578e+07   -1757417   <2e-16 ***
## true                  3.177e+14  4.744e+07    6697083   <2e-16 ***
## turn                 -2.118e+14  3.881e+07   -5456837   <2e-16 ***
## twitter              -1.261e+15  2.130e+07  -59190038   <2e-16 ***
## two                   1.137e+15  3.708e+07   30660787   <2e-16 ***
## updat                -1.147e+15  2.349e+07  -48832312   <2e-16 ***
## upgrad                1.557e+15  3.412e+07   45634466   <2e-16 ***
## use                  -3.404e+14  1.998e+07  -17038738   <2e-16 ***
## user                 -1.068e+15  4.875e+07  -21900372   <2e-16 ***
## via                  -2.463e+14  2.744e+07   -8974367   <2e-16 ***
## video                -1.236e+13  3.166e+07    -390310   <2e-16 ***
## wait                 -4.773e+12  2.648e+07    -180255   <2e-16 ***
## want                 -2.310e+14  2.132e+07  -10835058   <2e-16 ***
## watch                -1.968e+15  4.225e+07  -46582416   <2e-16 ***
## way                   7.861e+14  2.652e+07   29641136   <2e-16 ***
## week                 -6.609e+14  3.010e+07  -21955268   <2e-16 ***
## well                 -2.519e+14  2.428e+07  -10377566   <2e-16 ***
## what                 -1.850e+15  3.505e+07  -52780776   <2e-16 ***
## white                -2.057e+15  3.818e+07  -53871109   <2e-16 ***
## will                 -1.038e+15  1.517e+07  -68429967   <2e-16 ***
## windowsphon          -4.693e+14  3.529e+07  -13297779   <2e-16 ***
## wish                  2.394e+14  3.563e+07    6719950   <2e-16 ***
## without               1.722e+15  1.229e+08   14007378   <2e-16 ***
## wonder               -3.056e+15  4.465e+07  -68436414   <2e-16 ***
## wont                  1.125e+15  2.439e+07   46131994   <2e-16 ***
## work                 -6.739e+14  2.149e+07  -31352752   <2e-16 ***
## world                -1.100e+15  3.288e+07  -33464853   <2e-16 ***
## worst                -1.570e+15  4.205e+07  -37326100   <2e-16 ***
## wow                  -3.132e+15  4.167e+07  -75154670   <2e-16 ***
## wtf                   3.893e+15  3.388e+07  114906122   <2e-16 ***
## yall                 -1.082e+15  2.750e+07  -39331174   <2e-16 ***
## year                  3.767e+15  3.637e+07  103575648   <2e-16 ***
## yes                   1.477e+15  4.189e+07   35246920   <2e-16 ***
## yet                  -1.509e+15  3.875e+07  -38942395   <2e-16 ***
## yooo                  2.243e+15  4.613e+07   48624264   <2e-16 ***
## your                  1.373e+15  2.973e+07   46175370   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance:  708.98  on 825  degrees of freedom
## Residual deviance: 2523.06  on 523  degrees of freedom
## AIC: 3129.1
## 
## Number of Fisher Scoring iterations: 25

Make predictions

predictions <- predict(tweetLog, newdata = testSparse, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
table(testSparse$Negative, predictions >= 0.5)
##        
##         FALSE TRUE
##   FALSE   253   47
##   TRUE     22   33

Model accuracy is 0.8056338

(253+33)/nrow(testSparse)
## [1] 0.8056338

Our logistic regression model is even worse than baseline model. If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.

Note that you might have gotten a different answer than us, because the glm function struggles with this many variables. The warning messages that you might have seen in this problem have to do with the number of variables, and the fact that the model is overfitting to the training set.