Introduction

This report looks at the yelp hygiene dataset to predict whether a restaurant will pass the hygiene inspection or not. The dataset is composed of a training subset of 546 restaurants used to train the classifier, and a testing subset of 12753 restaurants used for evaluating the performance of the classifier. The dataset is spread across 3 files such that the first 546 lines in each file correspond to the training subset and the rest are part of the testing subset. Below is a description of each file:
- hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
- hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have “[None]” in their label field implying that they are part of the testing subset.
- hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).

Data Exploration and Cleanup

As a first step, we will read all the reviews from hygiene.dat as a data-frame, and create term document matrix for each restaurant. Since the training data-set consists of only 546 restaurants, we will use only the first 546 restaurant reviews from hygiene.dat for training our model. We will explore multiple models, first by using unigrams from term document matrix as the features, and again by using bigrams as the features. Before we extract the unigrams and bigrams as the features we will cleanup the reviews using tm package. We convert all the text to lowercase, remove all the profanity, remove stopwords, and apply stemming. We also remove all the sparse terms, and only allow 10% sparseness in the term document matrix.

We have 184 unigram features, and 57 bigram features.

Unigram Feature Set

colnames(hygiene.unigramDF)
##   [1] "actually"   "almost"     "also"       "always"     "amazing"   
##   [6] "amp"        "another"    "anything"   "area"       "around"    
##  [11] "atmosphere" "away"       "awesome"    "back"       "bad"       
##  [16] "bar"        "best"       "better"     "big"        "bit"       
##  [21] "came"       "can"        "cant"       "cheese"     "chicken"   
##  [26] "come"       "couple"     "day"        "decent"     "definitely"
##  [31] "delicious"  "didnt"      "different"  "dinner"     "dish"      
##  [36] "dishes"     "dont"       "drink"      "eat"        "eating"    
##  [41] "else"       "enough"     "especially" "even"       "ever"      
##  [46] "every"      "everything" "excellent"  "experience" "favorite"  
##  [51] "feel"       "find"       "first"      "flavor"     "food"      
##  [56] "found"      "fresh"      "friend"     "friendly"   "friends"   
##  [61] "full"       "get"        "give"       "go"         "going"     
##  [66] "good"       "got"        "great"      "happy"      "hard"      
##  [71] "home"       "hot"        "however"    "huge"       "id"        
##  [76] "ill"        "im"         "isnt"       "ive"        "just"      
##  [81] "kind"       "know"       "large"      "last"       "least"     
##  [86] "like"       "little"     "long"       "looking"    "lot"       
##  [91] "love"       "lunch"      "made"       "make"       "makes"     
##  [96] "many"       "maybe"      "meal"       "meat"       "menu"      
## [101] "might"      "minutes"    "much"       "need"       "never"     
## [106] "new"        "next"       "nice"       "night"      "nothing"   
## [111] "now"        "oh"         "ok"         "one"        "order"     
## [116] "ordered"    "people"     "perfect"    "place"      "places"    
## [121] "pretty"     "price"      "prices"     "probably"   "quality"   
## [126] "quick"      "quite"      "really"     "recommend"  "restaurant"
## [131] "right"      "said"       "salad"      "sauce"      "say"       
## [136] "seattle"    "see"        "served"     "service"    "side"      
## [141] "since"      "small"      "something"  "special"    "spot"      
## [146] "staff"      "stars"      "still"      "super"      "sure"      
## [151] "sweet"      "table"      "take"       "taste"      "tasty"     
## [156] "thats"      "thing"      "things"     "think"      "though"    
## [161] "thought"    "time"       "times"      "took"       "top"       
## [166] "tried"      "try"        "two"        "us"         "used"      
## [171] "usually"    "wait"       "want"       "wanted"     "wasnt"     
## [176] "way"        "well"       "went"       "will"       "without"   
## [181] "work"       "worth"      "years"      "youre"

Bigram Feature Set

colnames(hygiene.bigramDF)
##  [1] "back try"         "best ive"         "can get"         
##  [4] "cant wait"        "come back"        "coming back"     
##  [7] "dont get"         "dont know"        "dont like"       
## [10] "dont think"       "even though"      "every time"      
## [13] "feel like"        "felt like"        "first time"      
## [16] "food good"        "food great"       "food service"    
## [19] "go back"          "going back"       "good food"       
## [22] "good place"       "good service"     "great food"      
## [25] "great place"      "great service"    "happy hour"      
## [28] "highly recommend" "im sure"          "ive ever"        
## [31] "ive never"        "just right"       "last night"      
## [34] "last time"        "like place"       "little bit"      
## [37] "long time"        "love place"       "make sure"       
## [40] "much better"      "next time"        "one best"        
## [43] "one favorite"     "place go"         "place great"     
## [46] "pretty good"      "pretty much"      "really good"     
## [49] "really like"      "really nice"      "service good"    
## [52] "service great"    "staff friendly"   "tasted like"     
## [55] "wait staff"       "will back"        "will definitely"

Enhancing the Feature Set

We have some additional information about each restaurant in hygiene.dat.additional which we can use to enhance our feature set. We will extract features from this additional data for each restaurant, and add them to our existing feature set

After adding the additional features, we have 204 unigram features and 77 bigram features.

Unigram Feature set after adding additional features

colnames(hygiene.unigramDF)
##   [1] "actually"    "almost"      "also"        "always"      "amazing"    
##   [6] "amp"         "another"     "anything"    "area"        "around"     
##  [11] "atmosphere"  "away"        "awesome"     "back"        "bad"        
##  [16] "bar"         "best"        "better"      "big"         "bit"        
##  [21] "came"        "can"         "cant"        "cheese"      "chicken"    
##  [26] "come"        "couple"      "day"         "decent"      "definitely" 
##  [31] "delicious"   "didnt"       "different"   "dinner"      "dish"       
##  [36] "dishes"      "dont"        "drink"       "eat"         "eating"     
##  [41] "else"        "enough"      "especially"  "even"        "ever"       
##  [46] "every"       "everything"  "excellent"   "experience"  "favorite"   
##  [51] "feel"        "find"        "first"       "flavor"      "food"       
##  [56] "found"       "fresh"       "friend"      "friendly"    "friends"    
##  [61] "full"        "get"         "give"        "go"          "going"      
##  [66] "good"        "got"         "great"       "happy"       "hard"       
##  [71] "home"        "hot"         "however"     "huge"        "id"         
##  [76] "ill"         "im"          "isnt"        "ive"         "just"       
##  [81] "kind"        "know"        "large"       "last"        "least"      
##  [86] "like"        "little"      "long"        "looking"     "lot"        
##  [91] "love"        "lunch"       "made"        "make"        "makes"      
##  [96] "many"        "maybe"       "meal"        "meat"        "menu"       
## [101] "might"       "minutes"     "much"        "need"        "never"      
## [106] "new"         "next"        "nice"        "night"       "nothing"    
## [111] "now"         "oh"          "ok"          "one"         "order"      
## [116] "ordered"     "people"      "perfect"     "place"       "places"     
## [121] "pretty"      "price"       "prices"      "probably"    "quality"    
## [126] "quick"       "quite"       "really"      "recommend"   "restaurant" 
## [131] "right"       "said"        "salad"       "sauce"       "say"        
## [136] "seattle"     "see"         "served"      "service"     "side"       
## [141] "since"       "small"       "something"   "special"     "spot"       
## [146] "staff"       "stars"       "still"       "super"       "sure"       
## [151] "sweet"       "table"       "take"        "taste"       "tasty"      
## [156] "thats"       "thing"       "things"      "think"       "though"     
## [161] "thought"     "time"        "times"       "took"        "top"        
## [166] "tried"       "try"         "two"         "us"          "used"       
## [171] "usually"     "wait"        "want"        "wanted"      "wasnt"      
## [176] "way"         "well"        "went"        "will"        "without"    
## [181] "work"        "worth"       "years"       "youre"       "zip"        
## [186] "reviews"     "rating"      "american"    "bars"        "breakfast"  
## [191] "brunch"      "chinese"     "fast"        "food"        "italian"    
## [196] "japanese"    "mexican"     "new"         "pizza"       "sandwiches" 
## [201] "seafood"     "thai"        "traditional" "vietnamese"

Bigram Feature set after adding additional features

colnames(hygiene.bigramDF)
##  [1] "back try"         "best ive"         "can get"         
##  [4] "cant wait"        "come back"        "coming back"     
##  [7] "dont get"         "dont know"        "dont like"       
## [10] "dont think"       "even though"      "every time"      
## [13] "feel like"        "felt like"        "first time"      
## [16] "food good"        "food great"       "food service"    
## [19] "go back"          "going back"       "good food"       
## [22] "good place"       "good service"     "great food"      
## [25] "great place"      "great service"    "happy hour"      
## [28] "highly recommend" "im sure"          "ive ever"        
## [31] "ive never"        "just right"       "last night"      
## [34] "last time"        "like place"       "little bit"      
## [37] "long time"        "love place"       "make sure"       
## [40] "much better"      "next time"        "one best"        
## [43] "one favorite"     "place go"         "place great"     
## [46] "pretty good"      "pretty much"      "really good"     
## [49] "really like"      "really nice"      "service good"    
## [52] "service great"    "staff friendly"   "tasted like"     
## [55] "wait staff"       "will back"        "will definitely" 
## [58] "zip"              "reviews"          "rating"          
## [61] "american"         "bars"             "breakfast"       
## [64] "brunch"           "chinese"          "fast"            
## [67] "food"             "italian"          "japanese"        
## [70] "mexican"          "new"              "pizza"           
## [73] "sandwiches"       "seafood"          "thai"            
## [76] "traditional"      "vietnamese"

We can further enhance the feature set, by removing variables where the variance is zero.

nzv <- nearZeroVar(hygiene.unigramDF,saveMetrics=FALSE)
hygiene.unigramDF <- hygiene.unigramDF[,-nzv]
unigramtestData <- unigramtestData[,-nzv]
nzv <- nearZeroVar(hygiene.bigramDF,saveMetrics=FALSE)
hygiene.bigramDF <- hygiene.bigramDF[,-nzv]
bigramtestData <- bigramtestData[,-nzv]

Check for corelation and if 2 variables are corelated then only 1 is needed

x <- cor(hygiene.unigramDF)
y <- findCorrelation(x, cutoff = .75) 
hygiene.unigramDF <- hygiene.unigramDF[,-y]
unigramtestData <- unigramtestData[,-y]

After removing corelated variables we have 154 unigram features

Unigram Feature set after removing corelated variables

colnames(hygiene.unigramDF)
##   [1] "actually"    "almost"      "always"      "amazing"     "amp"        
##   [6] "another"     "anything"    "area"        "atmosphere"  "away"       
##  [11] "awesome"     "bad"         "bar"         "better"      "big"        
##  [16] "bit"         "cheese"      "chicken"     "come"        "couple"     
##  [21] "day"         "decent"      "different"   "dinner"      "dish"       
##  [26] "dishes"      "drink"       "eat"         "eating"      "else"       
##  [31] "enough"      "especially"  "every"       "everything"  "excellent"  
##  [36] "favorite"    "feel"        "find"        "flavor"      "found"      
##  [41] "fresh"       "friend"      "friendly"    "full"        "give"       
##  [46] "going"       "got"         "happy"       "hard"        "home"       
##  [51] "hot"         "however"     "huge"        "id"          "ill"        
##  [56] "isnt"        "kind"        "know"        "large"       "last"       
##  [61] "least"       "looking"     "lot"         "love"        "lunch"      
##  [66] "made"        "makes"       "many"        "maybe"       "meal"       
##  [71] "meat"        "menu"        "might"       "minutes"     "need"       
##  [76] "never"       "new"         "next"        "nice"        "night"      
##  [81] "nothing"     "now"         "oh"          "ok"          "order"      
##  [86] "perfect"     "places"      "pretty"      "price"       "prices"     
##  [91] "probably"    "quality"     "quick"       "quite"       "recommend"  
##  [96] "right"       "said"        "salad"       "sauce"       "say"        
## [101] "seattle"     "see"         "served"      "side"        "since"      
## [106] "small"       "something"   "special"     "spot"        "staff"      
## [111] "stars"       "super"       "sure"        "sweet"       "table"      
## [116] "take"        "taste"       "tasty"       "thing"       "things"     
## [121] "thought"     "times"       "took"        "top"         "tried"      
## [126] "two"         "used"        "usually"     "wait"        "want"       
## [131] "wanted"      "wasnt"       "way"         "without"     "work"       
## [136] "worth"       "years"       "youre"       "zip"         "rating"     
## [141] "american"    "brunch"      "chinese"     "food.1"      "italian"    
## [146] "japanese"    "mexican"     "new.1"       "pizza"       "sandwiches" 
## [151] "seafood"     "thai"        "traditional" "vietnamese"
x <- cor(hygiene.bigramDF)
y <- findCorrelation(x, cutoff = .75) 
hygiene.bigramDF <- hygiene.bigramDF[,-y]
bigramtestData <- bigramtestData[,-y]

After removing corelated variables we have 74 bigram features

Bigram Feature set after removing corelated variables

colnames(hygiene.bigramDF)
##  [1] "back try"         "best ive"         "can get"         
##  [4] "cant wait"        "come back"        "coming back"     
##  [7] "dont get"         "dont know"        "dont like"       
## [10] "dont think"       "even though"      "every time"      
## [13] "feel like"        "felt like"        "first time"      
## [16] "food good"        "food great"       "food service"    
## [19] "go back"          "going back"       "good food"       
## [22] "good place"       "good service"     "great food"      
## [25] "great place"      "great service"    "happy hour"      
## [28] "highly recommend" "im sure"          "ive ever"        
## [31] "ive never"        "just right"       "last night"      
## [34] "last time"        "like place"       "little bit"      
## [37] "long time"        "love place"       "make sure"       
## [40] "much better"      "next time"        "one best"        
## [43] "one favorite"     "place go"         "place great"     
## [46] "pretty good"      "pretty much"      "really good"     
## [49] "really like"      "really nice"      "service good"    
## [52] "service great"    "staff friendly"   "tasted like"     
## [55] "wait staff"       "will back"        "will definitely" 
## [58] "zip"              "reviews"          "rating"          
## [61] "american"         "breakfast"        "chinese"         
## [64] "food"             "italian"          "japanese"        
## [67] "mexican"          "new"              "pizza"           
## [70] "sandwiches"       "seafood"          "thai"            
## [73] "traditional"      "vietnamese"

Create hold out training sample

Now we will partition the training sample, into training and testing sample set, so that we can validate our classifier, before we apply it to our test data. We do this by partitioning the training data, into a sample of 70:30, such that 70% of actual training data is used for training the classifier, and the remaining 30% is used for validating the classifier.

hygiene.unigramDF <- cbind(hygiene.unigramDF, hygiene.labels)
colnames(hygiene.unigramDF)[ncol(hygiene.unigramDF)] <- "inspection"
set.seed(12354)
unigram.train.idx <- sample(nrow(hygiene.unigramDF), ceiling(nrow(hygiene.unigramDF) * 0.7))
unigram.test.idx <- (1:nrow(hygiene.unigramDF))[-unigram.train.idx]
unigram.training <- hygiene.unigramDF[unigram.train.idx,]
unigram.testing <- hygiene.unigramDF[unigram.test.idx,]
unigram.trainingLabels <- as.factor(unigram.training$inspection)
unigram.testingLabels <- as.factor(unigram.testing$inspection)
unigram.training <- unigram.training[,-which(colnames(unigram.training) == "inspection")]
unigram.testing <- unigram.testing[,-which(colnames(unigram.testing) == "inspection")]

hygiene.bigramDF <- cbind(hygiene.bigramDF, hygiene.labels)
colnames(hygiene.bigramDF)[ncol(hygiene.bigramDF)] <- "inspection"
bigram.train.idx <- sample(nrow(hygiene.bigramDF), ceiling(nrow(hygiene.bigramDF) * 0.7))
bigram.test.idx <- (1:nrow(hygiene.bigramDF))[-bigram.train.idx]
bigram.training <- hygiene.bigramDF[bigram.train.idx,]
bigram.testing <- hygiene.bigramDF[bigram.test.idx,]
bigram.trainingLabels <- as.factor(bigram.training$inspection)
bigram.testingLabels <- as.factor(bigram.testing$inspection)
bigram.training <- bigram.training[,-which(colnames(bigram.training) == "inspection")]
bigram.testing <- bigram.testing[,-which(colnames(bigram.testing) == "inspection")]

Modeling

We will run multiple machine learning models, and use the one that gives use the best results. We will try k nearest neighbor, random forest and stochastic gradient boosting. We will use F1 as the metric to be maximized

F1 <- function(data,lev,model){
  
  pred <- prediction(as.numeric(as.character(data$pred)), as.numeric(as.character(data$obs)))
  out <-performance(pred,"f")@y.values[[1]][2]
  names(out) <- "F1"
  out
}

k nearest neighbor

In knn classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

unigram knn model

unigram.knn.fit
## k-Nearest Neighbors 
## 
## 383 samples
## 153 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 345, 344, 344, 345, 345, 345, ... 
## Resampling results across tuning parameters:
## 
##   k  F1         F1 SD     
##   5  0.5890546  0.06779387
##   7  0.6250590  0.07753171
##   9  0.6357103  0.08532342
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.

confusion matrix for unigram knn model

unigram.knn.confusionmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46 43
##          1 27 47
##                                           
##                Accuracy : 0.5706          
##                  95% CI : (0.4908, 0.6477)
##     No Information Rate : 0.5521          
##     P-Value [Acc > NIR] : 0.3478          
##                                           
##                   Kappa : 0.1493          
##  Mcnemar's Test P-Value : 0.0730          
##                                           
##             Sensitivity : 0.6301          
##             Specificity : 0.5222          
##          Pos Pred Value : 0.5169          
##          Neg Pred Value : 0.6351          
##              Prevalence : 0.4479          
##          Detection Rate : 0.2822          
##    Detection Prevalence : 0.5460          
##       Balanced Accuracy : 0.5762          
##                                           
##        'Positive' Class : 0               
## 

bigram knn model

bigram.knn.fit
## k-Nearest Neighbors 
## 
## 383 samples
##  73 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 346, 345, 344, 345, 344, ... 
## Resampling results across tuning parameters:
## 
##   k  F1         F1 SD     
##   5  0.5453647  0.11753720
##   7  0.5211313  0.11848540
##   9  0.5363748  0.09093464
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.

confusion matrix for bigram knn model

bigram.knn.confusionmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51 26
##          1 50 36
##                                           
##                Accuracy : 0.5337          
##                  95% CI : (0.4541, 0.6121)
##     No Information Rate : 0.6196          
##     P-Value [Acc > NIR] : 0.989740        
##                                           
##                   Kappa : 0.0796          
##  Mcnemar's Test P-Value : 0.008333        
##                                           
##             Sensitivity : 0.5050          
##             Specificity : 0.5806          
##          Pos Pred Value : 0.6623          
##          Neg Pred Value : 0.4186          
##              Prevalence : 0.6196          
##          Detection Rate : 0.3129          
##    Detection Prevalence : 0.4724          
##       Balanced Accuracy : 0.5428          
##                                           
##        'Positive' Class : 0               
## 

Random Forest

Random Forests are an ensemble learning method for classification, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and Random Forests is their trademark. The method combines Breiman’s bagging idea and random selection of features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees with controlled variance.

unigram rf model

unigram.rf.fit
## Random Forest 
## 
## 383 samples
## 153 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 345, 344, 345, 345, 344, 345, ... 
## Resampling results across tuning parameters:
## 
##   mtry  F1         F1 SD     
##     10  0.6049382  0.07191913
##     50  0.5892803  0.06098618
##    100  0.6123985  0.07094478
##    500  0.5987488  0.06863563
##   1000  0.6070963  0.06390975
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 100.

confusion matrix for unigram rf model

unigram.rf.confusionmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51 38
##          1 24 50
##                                           
##                Accuracy : 0.6196          
##                  95% CI : (0.5404, 0.6944)
##     No Information Rate : 0.5399          
##     P-Value [Acc > NIR] : 0.02422         
##                                           
##                   Kappa : 0.2448          
##  Mcnemar's Test P-Value : 0.09874         
##                                           
##             Sensitivity : 0.6800          
##             Specificity : 0.5682          
##          Pos Pred Value : 0.5730          
##          Neg Pred Value : 0.6757          
##              Prevalence : 0.4601          
##          Detection Rate : 0.3129          
##    Detection Prevalence : 0.5460          
##       Balanced Accuracy : 0.6241          
##                                           
##        'Positive' Class : 0               
## 

bigram rf model

bigram.rf.fit
## Random Forest 
## 
## 383 samples
##  73 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 345, 345, 345, 344, 345, ... 
## Resampling results across tuning parameters:
## 
##   mtry  F1         F1 SD     
##     10  0.6537525  0.04721641
##     50  0.6615189  0.06886562
##    100  0.6377641  0.07695971
##    500  0.6206777  0.06910854
##   1000  0.6420762  0.06050119
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 50.

confusion matrix for bigram rf model

bigram.rf.confusionmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47 30
##          1 33 53
##                                           
##                Accuracy : 0.6135          
##                  95% CI : (0.5342, 0.6886)
##     No Information Rate : 0.5092          
##     P-Value [Acc > NIR] : 0.004721        
##                                           
##                   Kappa : 0.2262          
##  Mcnemar's Test P-Value : 0.801059        
##                                           
##             Sensitivity : 0.5875          
##             Specificity : 0.6386          
##          Pos Pred Value : 0.6104          
##          Neg Pred Value : 0.6163          
##              Prevalence : 0.4908          
##          Detection Rate : 0.2883          
##    Detection Prevalence : 0.4724          
##       Balanced Accuracy : 0.6130          
##                                           
##        'Positive' Class : 0               
## 

Stochastic Gradient Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

unigram gbm model

unigram.gbm.fit
## Stochastic Gradient Boosting 
## 
## 383 samples
## 153 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 345, 346, 345, 345, 344, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  F1         F1 SD     
##   2                  100      0.6324806  0.06978027
##   2                  200      0.6293300  0.06769479
##   2                  300      0.6130057  0.06217683
##   2                  400      0.6063549  0.05824169
##   2                  500      0.5959974  0.07142145
##   3                  100      0.5896646  0.09144147
##   3                  200      0.5718785  0.07660035
##   3                  300      0.5584011  0.06444784
##   3                  400      0.5881368  0.07923596
##   3                  500      0.5655245  0.07952033
##   4                  100      0.5833909  0.05070745
##   4                  200      0.5728192  0.06939895
##   4                  300      0.5625158  0.06988523
##   4                  400      0.5693650  0.07504513
##   4                  500      0.5630118  0.07180698
##   5                  100      0.5996130  0.07325693
##   5                  200      0.5960680  0.03772509
##   5                  300      0.5791496  0.03712235
##   5                  400      0.5869561  0.04584553
##   5                  500      0.5673033  0.05597792
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## F1 was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

confusion matrix for unigram gbm model

unigram.gbm.confusionmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 57 32
##          1 22 52
##                                           
##                Accuracy : 0.6687          
##                  95% CI : (0.5908, 0.7404)
##     No Information Rate : 0.5153          
##     P-Value [Acc > NIR] : 5.237e-05       
##                                           
##                   Kappa : 0.3393          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.7215          
##             Specificity : 0.6190          
##          Pos Pred Value : 0.6404          
##          Neg Pred Value : 0.7027          
##              Prevalence : 0.4847          
##          Detection Rate : 0.3497          
##    Detection Prevalence : 0.5460          
##       Balanced Accuracy : 0.6703          
##                                           
##        'Positive' Class : 0               
## 

bigram gbm model

bigram.gbm.fit
## Stochastic Gradient Boosting 
## 
## 383 samples
##  73 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 345, 344, 345, 345, 345, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  F1         F1 SD     
##   2                  100      0.5809324  0.12792297
##   2                  200      0.5883536  0.10505908
##   2                  300      0.5764462  0.11107246
##   2                  400      0.5588518  0.09939930
##   2                  500      0.5618966  0.09387904
##   3                  100      0.5901226  0.11674285
##   3                  200      0.6010791  0.09627211
##   3                  300      0.5910125  0.10108267
##   3                  400      0.5837870  0.07390445
##   3                  500      0.5754913  0.08684396
##   4                  100      0.6072137  0.10505361
##   4                  200      0.5651959  0.09119413
##   4                  300      0.5811336  0.06910575
##   4                  400      0.5423515  0.09052754
##   4                  500      0.5776810  0.09319436
##   5                  100      0.6200092  0.08823854
##   5                  200      0.5699468  0.08930689
##   5                  300      0.6131817  0.09525975
##   5                  400      0.6147405  0.08661185
##   5                  500      0.6163756  0.09030642
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## F1 was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.

confusion matrix for bigram gbm model

bigram.gbm.confusionmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 54 23
##          1 32 54
##                                           
##                Accuracy : 0.6626          
##                  95% CI : (0.5845, 0.7347)
##     No Information Rate : 0.5276          
##     P-Value [Acc > NIR] : 0.000328        
##                                           
##                   Kappa : 0.3272          
##  Mcnemar's Test P-Value : 0.280713        
##                                           
##             Sensitivity : 0.6279          
##             Specificity : 0.7013          
##          Pos Pred Value : 0.7013          
##          Neg Pred Value : 0.6279          
##              Prevalence : 0.5276          
##          Detection Rate : 0.3313          
##    Detection Prevalence : 0.4724          
##       Balanced Accuracy : 0.6646          
##                                           
##        'Positive' Class : 0               
## 

The author selected random forest model, as it gives the best F1 measure.