Week 5 Homework Part 1: Association Analysis

Use R Markdown and the TransFood excel file to answer all of the questions in Part 1 of this homework.

Load the TransFood excel file into R as a dataset of transactions. Each row in the dataset represents items purchased by a particular customer during one visit to a store.

tf <- read_xlsx("C:/Users/justt/Desktop/School/621/Assignment/Homework 5/TransFood.xlsx")

What is the largest total number of items purchased by a customer during one visit (i.e., what is the largest number of 1’s in any single row)? (Hint: You can answer this question by looking at a summary of the dataset.)

tf <- as(as.matrix(tf), "transactions")
summary(tf)
## transactions as itemMatrix in sparse format with
##  19076 rows (elements/itemsets/transactions) and
##  118 columns (items) and a density of 0.02230729 
## 
## most frequent items:
##   Bottled.WaterFood Slice.of.CheeseFood    Medium.DrinkFood     Small.DrinkFood 
##                3166                3072                2871                2769 
##   Slice.of.PeppFood             (Other) 
##                2354               35981 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##  197 5675 5178 3253 2129 1293  655  351  178   95   42   14    8    7    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.632   4.000  15.000 
## 
## includes extended item information - examples:
##              labels
## 1    Add.CheeseFood
## 2          BeerFood
## 3 Bottled.WaterFood

The largest total number of items purchased by a customer in one visit is 15 items and this occurred 1 time.

  1. Which item is most frequently bought by customers? Create a frequency plot to support your answer.
summary(tf)
## transactions as itemMatrix in sparse format with
##  19076 rows (elements/itemsets/transactions) and
##  118 columns (items) and a density of 0.02230729 
## 
## most frequent items:
##   Bottled.WaterFood Slice.of.CheeseFood    Medium.DrinkFood     Small.DrinkFood 
##                3166                3072                2871                2769 
##   Slice.of.PeppFood             (Other) 
##                2354               35981 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##  197 5675 5178 3253 2129 1293  655  351  178   95   42   14    8    7    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.632   4.000  15.000 
## 
## includes extended item information - examples:
##              labels
## 1    Add.CheeseFood
## 2          BeerFood
## 3 Bottled.WaterFood

The most frequent item purchased, per the summary, was Bottled.WaterFood and was purchased 3,166 times.

itemFrequencyPlot(tf, support = 0.1, cex.names = 0.8)

The most frequent item purchased, per the frequency plot, was Bottled.WaterFood and was purchased 3,166 times.

Perform association analysis to determine rules with support of at least 1% and confidence of at least 50%. See chart below.

tfbasket <- apriori(tf, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 190 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.01s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(tfbasket)
##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193
## [7] {Medium.DrinkFood,                                                                                                  
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01362969  0.5273834 0.02584399  3.274858   260
## [8] {Bottled.WaterFood,                                                                                                 
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01069407  0.5151515 0.02075907  3.198903   204
  1. How many association rules can you find that have support of at least 1% and confidence of at least 50%?
tfbasket
## set of 8 rules

There are 8 rules where the support is at least 1% and confidence is at least 50%.

  1. What is the smallest value of lift obtained for the association rules calculated in this question?
inspect(subset(tfbasket, min(lift)))
##     lhs                 rhs                      support    confidence
## [1] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 
##     coverage   lift     count
## [1] 0.02746907 7.601643 365

Based off this rcode output, the min lift is 7.601643. However, if I visully look at the tfbasket data set, I see that there are other smaller values. Based off looking at the table, the smallest lift is 3.198903. Why is the min function pulling back 7.601643?

  1. Based on the values of lift for each rule, are the items in each rule frequently purchased together? Or are the items seldom purchased together?
inspect(subset(tfbasket, lift > 1))
##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193
## [7] {Medium.DrinkFood,                                                                                                  
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01362969  0.5273834 0.02584399  3.274858   260
## [8] {Bottled.WaterFood,                                                                                                 
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01069407  0.5151515 0.02075907  3.198903   204

The lift of all 8 rules are greater than 1, meaning that these items in each rule are frequently purchased together.

  1. What is the average confidence obtained for the association rules calculated in this question?
summary(tfbasket)
## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01012   Min.   :0.5152   Min.   :0.01735   Min.   : 3.199  
##  1st Qu.:0.01290   1st Qu.:0.5516   1st Qu.:0.02115   1st Qu.: 3.534  
##  Median :0.01594   Median :0.6398   Median :0.02432   Median : 7.687  
##  Mean   :0.01641   Mean   :0.6790   Mean   :0.02387   Mean   : 6.965  
##  3rd Qu.:0.01776   3rd Qu.:0.7672   3rd Qu.:0.02696   3rd Qu.: 8.331  
##  Max.   :0.02857   Max.   :0.9982   Max.   :0.02862   Max.   :13.181  
##      count      
##  Min.   :193.0  
##  1st Qu.:246.0  
##  Median :304.0  
##  Mean   :313.1  
##  3rd Qu.:338.8  
##  Max.   :545.0  
## 
## mining info:
##  data ntransactions support confidence
##    tf         19076    0.01        0.5
##                                                                            call
##  apriori(data = tf, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

By looking at a summary of the tfbasket rules, we can see that the average (mean) of the Confidence is 0.6790.

  1. What is the median support obtained for the association rules calculated in this question?
summary(tfbasket)
## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01012   Min.   :0.5152   Min.   :0.01735   Min.   : 3.199  
##  1st Qu.:0.01290   1st Qu.:0.5516   1st Qu.:0.02115   1st Qu.: 3.534  
##  Median :0.01594   Median :0.6398   Median :0.02432   Median : 7.687  
##  Mean   :0.01641   Mean   :0.6790   Mean   :0.02387   Mean   : 6.965  
##  3rd Qu.:0.01776   3rd Qu.:0.7672   3rd Qu.:0.02696   3rd Qu.: 8.331  
##  Max.   :0.02857   Max.   :0.9982   Max.   :0.02862   Max.   :13.181  
##      count      
##  Min.   :193.0  
##  1st Qu.:246.0  
##  Median :304.0  
##  Mean   :313.1  
##  3rd Qu.:338.8  
##  Max.   :545.0  
## 
## mining info:
##  data ntransactions support confidence
##    tf         19076    0.01        0.5
##                                                                            call
##  apriori(data = tf, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

By looking at a summary of the tfbasket rules, we can see that the median Support is 0.01594.

  1. Create a plot to visualize the support, confidence, and lift of the rules calculated in this question.
library('arulesViz')
plot(tfbasket)

The above plot shows the visualization of the Support, Confidence, and Lift of the 8 rules.

  1. Choose two rules, and explain how the store could use these rules to improve sales.

Rules 4. When Chicken.TendersFood is purchased there is usually French.Fries.BasketFood purchased at the same time. 5. When CheeseburgerFood is purchased there is usually French.Fries.BasketFood purchased at the same time.

Assumption French.Fries.BasketFood is French Fries, CheeseburgerFood is Cheeseburgers, and Chicken.TendersFood is Chicken Tenders

I would recommend based on these two rules, that the store could increase the sales of French Fries, when purchasing Chicken Tenders or Cheeseburgers by locating French Fries nearby both of these items. Also if that isn’t possible, then a sign could be placed on the freezer door for Chicken Tenders and Cheeseburgers reminding the shopper to purchase French Fries and indicate which aisle they are in.

Week 5 Homework Part 2: Bagging and Random Forests

Use R Markdown and the Titanic passenger list excel file to answer all of the questions in Part 2 of this homework. This dataset contains information about all passengers (excluding crew members) on the Titanic.

Load the Titanic passenger list excel file into R, convert the dataset into a data frame, and convert all non-numeric columns of data into factors.

titanic <- read_xlsx("C:/Users/justt/Desktop/School/621/Assignment/Homework 5/Titanic passenger list.xlsx")
titanic <- as.data.frame(unclass(titanic),stringsAsFactors=TRUE)
str(titanic)
## 'data.frame':    1309 obs. of  7 variables:
##  $ Passenger_Class  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Survived         : Factor w/ 2 levels "Died","Lived": 2 2 1 1 1 2 2 1 2 1 ...
##  $ Sex              : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ Sibling_or_Spouse: num  0 1 1 1 1 0 1 0 2 0 ...
##  $ Parent_or_Child  : num  0 2 2 2 2 0 0 0 0 0 ...
##  $ Fare             : num  211 152 152 152 152 ...
##  $ Embarked         : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 3 3 3 3 3 3 1 ...
  1. Check for missing values in the dataset.
sum(is.na(titanic))
## [1] 3

Yes, there are 3 missing values in the data set.

  1. Create a table and/or graph displaying the number of passengers who embarked from each location (i.e., the number of passengers who embarked from Cherbourg, the number who embarked from Queenstown, etc.).
summary(titanic$Embarked)
##    Cherbourg   Queenstown Southhampton         NA's 
##          270          123          914            2

There were 270 passengers who embarked from Cherbourg. There were 123 passengers who embarked from Queenstown. There were 914 passengers who embarked from Southhampton. And 2 missing values.

From which location did the largest number of passengers embark?

The largest number of passengers embarked from Southhampton.

Impute all missing values in the Embarked column with this location.

titanic$Embarked[is.na(titanic$Embarked)==T] <- "Southhampton"
summary(titanic$Embarked)
##    Cherbourg   Queenstown Southhampton 
##          270          123          916
  1. For any missing value(s) in a numeric column(s) of data, impute the average of the column.
titanic$Fare[is.na(titanic$Fare)==T] <- mean(titanic$Fare, na.rm = T)
str(titanic$Fare)
##  num [1:1309] 211 152 152 152 152 ...
summary(titanic$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.896  14.454  33.295  31.275 512.329
sum(is.na(titanic))
## [1] 0

Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.

ind <- sample(nrow(titanic), nrow(titanic)*0.9) 
titanic_train <- titanic[ind,]
str(titanic_train)
## 'data.frame':    1178 obs. of  7 variables:
##  $ Passenger_Class  : num  1 1 2 1 3 3 1 3 2 3 ...
##  $ Survived         : Factor w/ 2 levels "Died","Lived": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Sex              : Factor w/ 2 levels "female","male": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Sibling_or_Spouse: num  1 2 1 0 4 0 0 1 1 0 ...
##  $ Parent_or_Child  : num  0 2 1 0 2 0 2 1 1 0 ...
##  $ Fare             : num  55.4 262.4 29 31 31.4 ...
##  $ Embarked         : Factor w/ 3 levels "Cherbourg","Queenstown",..: 1 1 3 1 3 2 3 1 3 3 ...
titanic_test <- titanic[-ind,]
str(titanic_test)
## 'data.frame':    131 obs. of  7 variables:
##  $ Passenger_Class  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Survived         : Factor w/ 2 levels "Died","Lived": 2 2 2 1 1 2 2 1 1 2 ...
##  $ Sex              : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
##  $ Sibling_or_Spouse: num  0 0 0 1 0 1 1 0 0 3 ...
##  $ Parent_or_Child  : num  0 0 1 0 0 1 0 0 0 2 ...
##  $ Fare             : num  26.6 26.6 55 136.8 25.6 ...
##  $ Embarked         : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 1 3 3 1 1 1 3 ...

Using 300 bootstrapped sets, develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

titanic_bag <- bagging(formula = Survived~., data = titanic_train, nbagg = 300)
titanic_bag
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train, 
##     nbagg = 300)
titanic_bag_pred <- predict(titanic_bag, newdata = titanic_train)
summary(titanic_bag_pred)
##  Died Lived 
##   747   431

Use the screenshot from 4a) in the submitted word doc. This predicts that 779 passengers died and 399 lived based on the training set.

  1. What is the out-of-bag error for your model?
titanic_bag_oob <- bagging(formula = Survived~.,
                          data = titanic_train,
                          coob = T, 
                          nbagg = 300)
titanic_bag_oob
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train, 
##     coob = T, nbagg = 300)
## 
## Out-of-bag estimate of misclassification error:  0.2029

The oob error (MSE) for my training model was 0.2097.

  1. Use your model to make predictions for the observations in your testing set.
titanic_bag_pred1 <- predict(titanic_bag, newdata = titanic_test)
summary(titanic_bag_pred1)
##  Died Lived 
##    80    51

Use the screenshot from 4b) in the submitted word doc. This predicts that 94 passengers died and 37 lived based on the testing set.

Using 300 trees, develop a random forest model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

titanic_rf <- randomForest(Survived~., data=titanic_train, importance = TRUE)
titanic_rf
## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.19%
## Confusion matrix:
##       Died Lived class.error
## Died   667    62  0.08504801
## Lived  164   285  0.36525612
mod_rf <- randomForest(Survived~., data=titanic_train, importance = TRUE, ntree = 300)
mod_rf
## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE,      ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.19%
## Confusion matrix:
##       Died Lived class.error
## Died   668    61  0.08367627
## Lived  165   284  0.36748330
  1. What is the out-of-bag error for your model?
mod_rf
## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE,      ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.19%
## Confusion matrix:
##       Died Lived class.error
## Died   668    61  0.08367627
## Lived  165   284  0.36748330

Use the screenshot from 5) in the submitted word doc. The out-of-bag estimate error is 0.202 (20.2%).

  1. Based on the out-of-bag error, is the random forest better at predicting survival than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?
titanic_bag_oob
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train, 
##     coob = T, nbagg = 300)
## 
## Out-of-bag estimate of misclassification error:  0.2029
mod_rf
## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE,      ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.19%
## Confusion matrix:
##       Died Lived class.error
## Died   668    61  0.08367627
## Lived  165   284  0.36748330

Use the screenshot from 5b) in the submitted word doc. The bagging model oob error was 0.2097 (20.97%), and the random forest oob error was 0.2020 (20.2%). The random forest model was better (smaller) then the bagging model.

  1. What is the false positive rate (for your random forest model developed with 300 trees)?

Use the screenshot from 5b) in the submitted word doc.The False Positive Rate is 83/1,178 = 0.07045 (70.45%).

  1. What is the false negative rate (for your random forest model developed with 300 tress)?

Use the screenshot from 5b) in the submitted word doc. The false negative rate is 155/1,178 = 0.13157 (13.16%).

  1. Create a plot comparing the out-of-bag error, the false positive rate, the false negative rate, and the number of trees. Based on your plot, does it appear that a large number of trees are needed to develop a fairly accurate random forest model?
plot(mod_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))

Use the screenshot from 5e) in the submitted word doc. No, it appears that less than 50, closer to 25 trees are needed.

  1. Use your random forest model to make predictions for the observations in your testing set.
titanic_rf_pred <- predict(titanic_rf, titanic_test)
summary(titanic_rf_pred)
##  Died Lived 
##    92    39

Use the screenshot from 5f) in the submitted word doc. This predicts that 92 passengers died and 39 lived based on the testing set with my random forest model.