Reading in both datasets for simplicity purposes:

TransFood <- read_excel("C:/Users/elyse/OneDrive/Desktop/TransFood.xlsx")
Titanic <- read_excel("C:/Users/elyse/OneDrive/Desktop/Titanic passenger list.xlsx")
TransFood <- as(as.matrix(TransFood), "transactions")

Titanic <- read_excel("C:/Users/elyse/OneDrive/Desktop/Titanic passenger list.xlsx")

Part 1: Association Analysis

1. Summary - Items Purchased

Determine the largest total number of items purchased by a customer in one visit

summary(TransFood)

## transactions as itemMatrix in sparse format with
##  19076 rows (elements/itemsets/transactions) and
##  118 columns (items) and a density of 0.02230729 
## 
## most frequent items:
##   Bottled.WaterFood Slice.of.CheeseFood    Medium.DrinkFood     Small.DrinkFood 
##                3166                3072                2871                2769 
##   Slice.of.PeppFood             (Other) 
##                2354               35981 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##  197 5675 5178 3253 2129 1293  655  351  178   95   42   14    8    7    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.632   4.000  15.000 
## 
## includes extended item information - examples:
##              labels
## 1    Add.CheeseFood
## 2          BeerFood
## 3 Bottled.WaterFood

The greatest amount of items purchased by a single customer was 15.

2. Frequently Purchased

Create a frequency plot of the most purchased items.

itemFrequencyPlot(TransFood,  topN = 10,  type = "absolute", cex.names = 1)

The item most frequently purchased by customers is bottled water! Wasting plastic!!!

3. Association Analysis

a. Rules

Determine number of association rules that can be found that have support of a least 1% and confidence of at least 50%

basket_rules <- apriori(TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 190 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(basket_rules)

## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01012   Min.   :0.5152   Min.   :0.01735   Min.   : 3.199  
##  1st Qu.:0.01290   1st Qu.:0.5516   1st Qu.:0.02115   1st Qu.: 3.534  
##  Median :0.01594   Median :0.6398   Median :0.02432   Median : 7.687  
##  Mean   :0.01641   Mean   :0.6790   Mean   :0.02387   Mean   : 6.965  
##  3rd Qu.:0.01776   3rd Qu.:0.7672   3rd Qu.:0.02696   3rd Qu.: 8.331  
##  Max.   :0.02857   Max.   :0.9982   Max.   :0.02862   Max.   :13.181  
##      count      
##  Min.   :193.0  
##  1st Qu.:246.0  
##  Median :304.0  
##  Mean   :313.1  
##  3rd Qu.:338.8  
##  Max.   :545.0  
## 
## mining info:
##       data ntransactions support confidence
##  TransFood         19076    0.01        0.5
##                                                                                   call
##  apriori(data = TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

There are 8 rules generated.

b. Lift

Determine the smallest value of lift obtained for the association rules calculated.

inspect(basket_rules)

##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193
## [7] {Medium.DrinkFood,                                                                                                  
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01362969  0.5273834 0.02584399  3.274858   260
## [8] {Bottled.WaterFood,                                                                                                 
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01069407  0.5151515 0.02075907  3.198903   204

The smallest value of lift is 3.199

c. Items Purchased Together

Lift values that are greater than one would indicate that these items are puchased together. All values of lift contained within the rules generated are greater than that of one, meaning that they are frequently purchased together.

d. Average Confidence

sum1 <- inspect(basket_rules)$confidence

##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193
## [7] {Medium.DrinkFood,                                                                                                  
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01362969  0.5273834 0.02584399  3.274858   260
## [8] {Bottled.WaterFood,                                                                                                 
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01069407  0.5151515 0.02075907  3.198903   204

mean(sum1, na.rm = TRUE)

## Warning in mean.default(sum1, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## [1] NA

The mean of confidence is equal to .679

e. Median Support

determine the median support obtained for the association rules calculated.

sum2 <- inspect(basket_rules)$support

##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193
## [7] {Medium.DrinkFood,                                                                                                  
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01362969  0.5273834 0.02584399  3.274858   260
## [8] {Bottled.WaterFood,                                                                                                 
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01069407  0.5151515 0.02075907  3.198903   204

median(sum2, na.rm = TRUE)

## NULL

The median support is .0159

f. Create Visualizations

Create a plot to visualize the support, confidence, and lift of the rules calculated.

plot(basket_rules)

g. Store Benefit

{ToppingFood} => {Ice.Cream.ConeFood}

These items are frequently purchased together as they go hand on. Additionally, these items are likely have a high level of seasonality. During a downturn of sales, these items could be place at a slightly reduced cost, also known as a sale, at the same time and having additional toppings options placed in the ice cream section.

{Chicken.TendersFood} => {French.Fries.BasketFood}

Chicken tenders and fries are an iconic and staple! Now that it is known that there are frequently purchased at the same time, a combined item should be created or discount on the addition of a side item should be incorporated to increase sales of both items. Or if they purchase fries, a free drink could be offered.

Part 2: Bagging and Random Forests

1. Convert non-numeric columns to factors

Titanic$Survived <- as.factor(Titanic$Survived)
Titanic$Sex <- as.factor(Titanic$Sex)
Titanic$Embarked <- as.factor(Titanic$Embarked)

2. Check for Missing Values

sum(is.na(Titanic))

## [1] 3

missing <- which(apply(Titanic, 1, function(row) any(is.na(row))))
missing

## [1]  169  285 1226

Titanic[c(169,285,1226),]

## # A tibble: 3 × 7
##   Passenger_Class Survived Sex    Sibling_or_Spouse Parent_or_Ch…¹  Fare Embar…²
##             <dbl> <fct>    <fct>              <dbl>          <dbl> <dbl> <fct>  
## 1               1 Lived    female                 0              0    80 <NA>   
## 2               1 Lived    female                 0              0    80 <NA>   
## 3               3 Died     male                   0              0    NA Southh…
## # … with abbreviated variable names ¹Parent_or_Child, ²Embarked

There are 3 missing values, 2 embarked, one with fare

Impute

Count <- summary(Titanic$Embarked)
Count <- as.data.frame(Count)
Count

##              Count
## Cherbourg      270
## Queenstown     123
## Southhampton   914
## NA's             2

ggplot(Titanic, aes(x = Embarked, fill = Embarked)) +
  geom_bar() +
  labs(title = "Passengers Embarked from Each City",
       x = "Embarked City",
       y = "Count")

mean(Titanic$Fare, na.rm = TRUE)

## [1] 33.29548

Titanic$Embarked[is.na(Titanic$Embarked)] <- "Southhampton"
Titanic$Fare[is.na(Titanic$Fare)] <- mean(Titanic$Fare, na.rm = TRUE)

sum(is.na(Titanic))

## [1] 0

Southhampton has the greatest count and will be the impute value for NAs for the Embarked city

Fare will be imputed with the mean 33.29548.

After these changes, there is now 0 items that have NA values.

3. Randomly Sample Rows

Sample the rows of your data to include 90% of the rows in your training set

index <- sample(nrow(Titanic), nrow(Titanic)*0.90)
train = Titanic[index,]
test = Titanic[-index,]

4. Bagging Model

Develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

titanic_bag <- bagging(formula = Survived~., data = train, coob = T, nbagg = 300) 
titanic_bag

## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = train, coob = T, 
##     nbagg = 300)
## 
## Out-of-bag estimate of misclassification error:  0.2097

####. a OOB Error

titanic_bag_pred <- predict(titanic_bag, newdata = test)
titanic_bag_pred

##   [1] Lived Died  Lived Lived Lived Died  Died  Lived Lived Lived Lived Lived
##  [13] Died  Died  Lived Lived Died  Lived Died  Lived Lived Died  Lived Lived
##  [25] Died  Died  Lived Died  Died  Died  Lived Lived Died  Lived Lived Died 
##  [37] Died  Died  Lived Died  Lived Lived Died  Lived Died  Died  Lived Died 
##  [49] Died  Died  Died  Died  Lived Lived Died  Lived Lived Died  Died  Died 
##  [61] Died  Lived Lived Died  Died  Died  Lived Died  Lived Died  Lived Died 
##  [73] Lived Died  Died  Died  Lived Died  Lived Died  Lived Died  Died  Died 
##  [85] Died  Died  Died  Lived Died  Died  Died  Died  Died  Died  Died  Lived
##  [97] Died  Died  Lived Lived Lived Lived Died  Lived Died  Died  Died  Died 
## [109] Died  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died  Died 
## [121] Died  Lived Died  Died  Died  Died  Died  Died  Died  Died  Died 
## Levels: Died Lived

titanic_bag_pred_binary <- ifelse(titanic_bag_pred == 'Lived', 1, 0)
titanic_bag_pred_binary

##   [1] 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0
##  [38] 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0
##  [75] 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

test$Survived <- ifelse(test$Survived == 'Lived', 1, 0)
test$Survived

##   [1] 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0
##  [38] 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0
##  [75] 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0
## [112] 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

Converted data types to be in 0 and 1 to get a calculated value.

mean((test$Survived - titanic_bag_pred_binary)^2)

## [1] 0.1755725

sum(test$Survived != titanic_bag_pred_binary)

## [1] 23

The OOB error is .2199, this is generally considered to be a “good” or acceptable value for OOB. The lower this value is, the better performance of a model.
The mean squared error is .0992 meanng that 9.92% of the time the model is making incorrect predictions. Which makes sense, if we look at the predicted values for the data in comparison to that of the actual data from the test data set, they are very comparable. There are 13 values that are not predicted correctly. Hence, based upon a low OOB and a low MSE I would conclude that this model is performing quite well!

5.Random Forest

mod_rf <- randomForest(Survived~., data=train, importance = TRUE, ntree = 300)

a. OOB

Determine the OOB error for this model

mod_rf

## 
## Call:
##  randomForest(formula = Survived ~ ., data = train, importance = TRUE,      ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.8%
## Confusion matrix:
##       Died Lived class.error
## Died   644    78   0.1080332
## Lived  167   289   0.3662281

The OOB error estimate is .1944 or 19.44%

Considering that the goal is to have a lower OOB error, the random forest would be performing slightly better than that of the bagging model. Yet, there is a trade off when choosing a model of greater complexity, hence both models would be useful.

c. and d.

The false positive rate is equal to .0825 or 7.771% These are the cases where the model predicted that Lived but the person died. The false negatve rate is equal to .3895 or 38.27%. These are the cases where the model predicted Died but the person lived.

e. Plot OOB

Create a plot comparing the out-of-bag error, the false positive rate, the false negative rate, and the number of trees

plot(mod_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))

This model likely could have been simplified. It appears tha after a much lower amount of trees the OOB stabilizes. Which would be indicative that we could use less.

test theory:

mod2_rf <- randomForest(Survived~., data=train, importance = TRUE, ntree = 100)
mod2_rf

## 
## Call:
##  randomForest(formula = Survived ~ ., data = train, importance = TRUE,      ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.37%
## Confusion matrix:
##       Died Lived class.error
## Died   645    77   0.1066482
## Lived  163   293   0.3574561

Reduced trees and only see the slightest increase in OOB. Additionally the classification rate errors have no increased by a substantial amount either.

Homework #4

Elyse Stasil

2023-11-15