Homework 5

Some questions on this homework involve random sampling, and so each student may obtain slightly different results. Because of this, I need to see both your code and resulting output. For simplicity and to ensure that your output matches your textual explanations and answers, I will accept copies of your output placed in Microsoft Word for this assignment. However if you feel more comfortable doing so, you can also alternatively submit an HTML document produced in R Markdown. Just be sure that the HTML document displays your output and that your written answers match the displayed output.

Part 1: Association Analysis

Use R Markdown and the TransFood excel file to answer all of the questions in Part 1 of this homework.

  1. Load the TransFood excel file into R as a dataset of transactions.
#import the read excel library
library(readxl)

#install.packages("arules")
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
TransFood <- read_excel("/Users/kamriefoster/Downloads/TransFood.xlsx")
TransFood <- as(as.matrix(TransFood), "transactions") 

Each row in the dataset represents items purchased by a particular customer during one visit to a store. What is the largest total number of items purchased by a customer during one visit (i.e., what is the largest number of 1’s in any single row)? (Hint: You can answer this question by looking at a summary of the dataset.)

summary(TransFood)
## transactions as itemMatrix in sparse format with
##  19076 rows (elements/itemsets/transactions) and
##  118 columns (items) and a density of 0.02230729 
## 
## most frequent items:
##   Bottled.WaterFood Slice.of.CheeseFood    Medium.DrinkFood     Small.DrinkFood 
##                3166                3072                2871                2769 
##   Slice.of.PeppFood             (Other) 
##                2354               35981 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##  197 5675 5178 3253 2129 1293  655  351  178   95   42   14    8    7    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.632   4.000  15.000 
## 
## includes extended item information - examples:
##              labels
## 1    Add.CheeseFood
## 2          BeerFood
## 3 Bottled.WaterFood

The largest total number of items purchased by a customer during one visit was 15.

  1. Which item is most frequently bought by customers? Create a frequency plot to support your answer.
itemFrequencyPlot(TransFood, support = 0.1, cex.names = 0.8)

The item most frequently bought by customers is Bottled.WaterFood according to the frequency plot created.

  1. Perform association analysis to determine rules with support of at least 1% and confidence of at least 50%.
## Run the apriori algorithm

#creating the association rules
basket_rules <- apriori(TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules")) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 190 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# find all rules with a minimum support of 1% and a minimum of confidence of 50%, maxlen by default is set to 10 (meaning that you can have no more than 10 items in a rule)
summary(basket_rules)
## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01012   Min.   :0.5152   Min.   :0.01735   Min.   : 3.199  
##  1st Qu.:0.01290   1st Qu.:0.5516   1st Qu.:0.02115   1st Qu.: 3.534  
##  Median :0.01594   Median :0.6398   Median :0.02432   Median : 7.687  
##  Mean   :0.01641   Mean   :0.6790   Mean   :0.02387   Mean   : 6.965  
##  3rd Qu.:0.01776   3rd Qu.:0.7672   3rd Qu.:0.02696   3rd Qu.: 8.331  
##  Max.   :0.02857   Max.   :0.9982   Max.   :0.02862   Max.   :13.181  
##      count      
##  Min.   :193.0  
##  1st Qu.:246.0  
##  Median :304.0  
##  Mean   :313.1  
##  3rd Qu.:338.8  
##  Max.   :545.0  
## 
## mining info:
##       data ntransactions support confidence
##  TransFood         19076    0.01        0.5
##                                                                                   call
##  apriori(data = TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
  1. How many association rules can you find that have support of at least 1% and confidence of at least 50%?

There are 8 association rules that have a support of at least 1% and confidence of at least 50%.

  1. What is the smallest value of lift obtained for the association rules calculated in this question?

The minimum lift value obtained for the association rules calculated is 3.199.

  1. Based on the values of lift for each rule, are the items in each rule frequently purchased together? Or are the items seldom purchased together?

Since all lift values are greater than 1, the rules indicate that the items in each rule tend to be bought together. If you purchase the item(s) in the rule then, you are more likely to purchase the other item(s) in the rule.

d.What is the average confidence obtained for the association rules calculated in this question?

The average confidence is 0.6398 for the association rules calculated in this question.

  1. What is the median support obtained for the association rules calculated in this question?

The median support obtained for the association rules calculated in this question is 0.01594.

  1. Create a plot to visualize the support, confidence, and lift of the rules calculated in this question.
#install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)

  1. Choose two rules, and explain how the store could use these rules to improve sales.
# Check the generated rules using inspect
inspect(basket_rules)
##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193
## [7] {Medium.DrinkFood,                                                                                                  
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01362969  0.5273834 0.02584399  3.274858   260
## [8] {Bottled.WaterFood,                                                                                                 
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01069407  0.5151515 0.02075907  3.198903   204

Rule 5:
Add. CheeseburgerFood => French.Fries.BasketFood
This rule indicates that people who buy a cheese burger are more likely to buy a basket of fries.

Rule 4: Chicken.TendersFood => French.Fries.BasketFood
This rule indicates that people who buy chicken tenders and more likely to buy a basket of fries.

Knowing this, a store could sell these items as a combo deal as opposed to selling them separately. For example, like McDonald’s and other fast food restaurants provide an option on the menu for a meal where these two items are included for one price. Since these combinations occur so often it will provide a convenience factor for customers. With a meal option people will likely feel more inclined to buy both the burger or tenders with fries as opposed to just the burger or tenders without fries. The business could also increase the price of the meal slightly more than that of the two added together and get no argument from too many customers (as many people won’t take the time to do this calculation). So now the business will not only sell more of each product but potentially sell them at a higher price increasing sales and profit.

Part 2: Bagging and Random Forests

Use R Markdown and the Titanic passenger list excel file to answer all of the questions in Part 2 of this homework. This dataset contains information about all passengers (excluding crew members) on the Titanic.

  1. Load the Titanic passenger list excel file into R, convert the dataset into a data frame, and convert all non-numeric columns of data into factors.
Titanic <- read_excel("/Users/kamriefoster/Downloads/Titanic passenger list.xlsx")
Titanic <- as.data.frame(Titanic)

Titanic$Survived <- as.factor(Titanic$Survived)
Titanic$Sex <- as.factor(Titanic$Sex)
Titanic$Embarked <- as.factor(Titanic$Embarked)
  1. Check for missing values in the dataset.
sum(is.na(Titanic))
## [1] 3

There are 3 missing values in the dataset that should be handled.

  1. Create a table and/or graph displaying the number of passengers who embarked from each location (i.e., the number of passengers who embarked from Cherbourg, the number whoembarked from Queenstown, etc.). From which location did the largest number of passengers embark?
table(Titanic$Embarked)
## 
##    Cherbourg   Queenstown Southhampton 
##          270          123          914
#can also view this by using the "mode" function in R
# Create the function.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

result <- getmode(Titanic$Embarked)
print(result)
## [1] Southhampton
## Levels: Cherbourg Queenstown Southhampton

The largest number of passengers embarked from Southhampton according to the table generated above.

Impute all missing values in the Embarked column with this location.

# Impute missing values by the most common location

Titanic$Embarked[is.na(Titanic$Embarked)] <- getmode(Titanic$Embarked)
sum(is.na(Titanic$Embarked))
## [1] 0
  1. For any missing value(s) in a numeric column(s) of data, impute the average of the column.
# Impute missing values by median

Titanic$Parent_or_Child[is.na(Titanic$Parent_or_Child)] <- median(Titanic$Parent_or_Child, na.rm = T)

Titanic$Sibling_or_Spouse[is.na(Titanic$Sibling_or_Spouse)] <- median(Titanic$Sibling_or_Spouse, na.rm = T)

Titanic$Fare[is.na(Titanic$Fare)] <- median(Titanic$Fare, na.rm = T)
sum(is.na(Titanic))
## [1] 0
  1. Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.
index <- sample(nrow(Titanic), nrow(Titanic)*0.90)
Titanic_train = Titanic[index,]
Titanic_test = Titanic[-index,]
  1. Using 300 bootstrapped sets, develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.
### Create the bagging model by using the following library ###

#install.packages('ipred')
library(ipred)

Developing the model

Titanic_bag <- bagging(formula = Survived~., data = Titanic_train, nbagg = 300) 
#300 bootstrap samples, and we fit a tree model to each of these 300 bootstrap samples. Final is the average of the 300 trees. 
#nbag is the number of bootstraps we want
Titanic_bag
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = Titanic_train, 
##     nbagg = 300)
  1. What is the out-of-bag error for your model?
Titanic_bag_oob <- bagging(formula = Survived~.,
                          data = Titanic_train,
                          coob = T,
                          nbagg = 300) #coob = T means that it will automatically calculate the out-of-bag prediction error, finds the mean squared error
Titanic_bag_oob
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = Titanic_train, 
##     coob = T, nbagg = 300)
## 
## Out-of-bag estimate of misclassification error:  0.208

The out of bag estimate of misclassification error was found to be around 0.2037.

  1. Use your model to make predictions for the observations in your testing set
Titanic_bag_pred <- predict(Titanic_bag, newdata = Titanic_test)
Titanic_bag_pred
##   [1] Died  Lived Lived Lived Lived Lived Lived Died  Died  Lived Died  Lived
##  [13] Lived Lived Lived Died  Lived Lived Lived Lived Lived Lived Lived Lived
##  [25] Lived Lived Lived Lived Lived Lived Died  Died  Died  Died  Lived Lived
##  [37] Died  Died  Lived Died  Died  Died  Died  Died  Lived Lived Died  Died 
##  [49] Died  Lived Died  Died  Lived Lived Lived Lived Lived Died  Lived Lived
##  [61] Lived Died  Lived Died  Lived Died  Died  Died  Died  Lived Died  Lived
##  [73] Died  Died  Died  Died  Died  Lived Lived Died  Died  Died  Died  Lived
##  [85] Died  Died  Died  Died  Died  Lived Lived Died  Died  Died  Died  Died 
##  [97] Died  Lived Lived Died  Died  Died  Died  Lived Died  Died  Died  Lived
## [109] Died  Died  Lived Died  Died  Died  Died  Died  Died  Lived Lived Died 
## [121] Died  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died 
## Levels: Died Lived
table(Titanic_bag_pred)
## Titanic_bag_pred
##  Died Lived 
##    76    55
  1. Using 300 trees, develop a random forest model on your training set to predict whether or not a passenger survived the sinking of the Titanic.
#packages and libraries needed

#install.packages('randomForest')
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
Titanic_rf <- randomForest(as.factor(Survived)~.,
                          data = Titanic_train,
                          importance = TRUE, ntree = 300)

Titanic_rf
## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ ., data = Titanic_train,      importance = TRUE, ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.95%
## Confusion matrix:
##       Died Lived class.error
## Died   661    65  0.08953168
## Lived  170   282  0.37610619
  1. What is the out-of-bag error for your model?

The out-of-bag error for this model is around 0.1961.

  1. Based on the out-of-bag error, is the random forest better at predicting survival than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?

Based on the out-of-bag errors, the random forest model acts slightly better than the bagging model. However, the two models return approximately the same out-of-bag error thus it would be better to state that these models seem to perform with about the same accuracy (depending on testing and training datasets- since they are randomized each time they run).

  1. What is the false positive rate (for your random forest model developed with 300 trees)?

The false positive rate is 0.08493151.

  1. What is the false negative rate (for your random forest model developed with 300 tress)?

The false negative rate is 0.37723214.

  1. Create a plot comparing the out-of-bag error, the false positive rate, the false negative rate, and the number of trees. Based on your plot, does it appear that a large number of trees are needed to develop a fairly accurate random forest model?
#plotting the random forest we made where line width is 2 for all three lines.
plot(Titanic_rf, lwd=rep(2, 3))

#legend is at the right, legend has width of 2
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))

  1. Use your random forest model to make predictions for the observations in your testing set.
Titanic_rf_pred <- predict(Titanic_rf, newdata = Titanic_test)

Titanic_rf_pred
##     6    12    13    14    18    23    50    59    61    62    97   102   112 
##  Died Lived Lived Lived Lived  Died  Died  Died  Died Lived  Died  Died Lived 
##   122   127   133   134   147   150   170   177   207   211   231   239   254 
## Lived  Died  Died  Died Lived Lived Lived Lived Lived  Died Lived Lived Lived 
##   274   279   283   290   314   318   319   352   362   370   386   390   402 
##  Died  Died Lived Lived  Died  Died  Died  Died Lived Lived  Died  Died Lived 
##   405   422   423   424   449   476   483   492   493   498   503   528   533 
##  Died  Died  Died  Died  Died Lived Lived  Died  Died  Died Lived  Died  Died 
##   535   537   549   552   558   575   587   600   602   630   661   666   668 
## Lived Lived Lived Lived Lived  Died Lived Lived  Died  Died Lived  Died Lived 
##   670   674   678   686   693   694   703   712   714   719   732   750   765 
##  Died  Died  Died  Died Lived  Died Lived  Died  Died  Died  Died  Died  Died 
##   766   770   774   775   776   781   798   813   815   817   827   841   873 
## Lived  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died 
##   903   908   921   922   931   932   940   947   952   958   975  1011  1015 
##  Died  Died  Died  Died  Died  Died Lived Lived  Died  Died  Died  Died  Died 
##  1021  1028  1031  1041  1044  1045  1051  1065  1072  1081  1086  1096  1107 
##  Died  Died  Died Lived  Died Lived Lived  Died  Died  Died  Died Lived  Died 
##  1114  1123  1143  1158  1171  1194  1205  1209  1210  1252  1258  1271  1283 
## Lived Lived  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died 
##  1299 
##  Died 
## Levels: Died Lived
table(Titanic_rf_pred)
## Titanic_rf_pred
##  Died Lived 
##    88    43