Reading in both datasets for simplicity purposes:
TransFood <- read_excel("C:/Users/elyse/OneDrive/Desktop/TransFood.xlsx")
Titanic <- read_excel("C:/Users/elyse/OneDrive/Desktop/Titanic passenger list.xlsx")
TransFood <- as(as.matrix(TransFood), "transactions")Part 1: Association Analysis
1. Summary - Items Purchased
Determine the largest total number of items purchased by a customer in one visit
## transactions as itemMatrix in sparse format with
## 19076 rows (elements/itemsets/transactions) and
## 118 columns (items) and a density of 0.02230729
##
## most frequent items:
## Bottled.WaterFood Slice.of.CheeseFood Medium.DrinkFood Small.DrinkFood
## 3166 3072 2871 2769
## Slice.of.PeppFood (Other)
## 2354 35981
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 197 5675 5178 3253 2129 1293 655 351 178 95 42 14 8 7 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.632 4.000 15.000
##
## includes extended item information - examples:
## labels
## 1 Add.CheeseFood
## 2 BeerFood
## 3 Bottled.WaterFood
The greatest amount of items purchased by a single customer was 15.
2. Frequently Purchased
Create a frequency plot of the most purchased items.
The item most frequently purchased by customers is bottled water!
Wasting plastic!!!
3. Association Analysis
a. Rules
Determine number of association rules that can be found that have support of a least 1% and confidence of at least 50%
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 190
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01012 Min. :0.5152 Min. :0.01735 Min. : 3.199
## 1st Qu.:0.01290 1st Qu.:0.5516 1st Qu.:0.02115 1st Qu.: 3.534
## Median :0.01594 Median :0.6398 Median :0.02432 Median : 7.687
## Mean :0.01641 Mean :0.6790 Mean :0.02387 Mean : 6.965
## 3rd Qu.:0.01776 3rd Qu.:0.7672 3rd Qu.:0.02696 3rd Qu.: 8.331
## Max. :0.02857 Max. :0.9982 Max. :0.02862 Max. :13.181
## count
## Min. :193.0
## 1st Qu.:246.0
## Median :304.0
## Mean :313.1
## 3rd Qu.:338.8
## Max. :545.0
##
## mining info:
## data ntransactions support confidence
## TransFood 19076 0.01 0.5
## call
## apriori(data = TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
There are 8 rules generated.
b. Lift
Determine the smallest value of lift obtained for the association rules calculated.
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
## [7] {Medium.DrinkFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01362969 0.5273834 0.02584399 3.274858 260
## [8] {Bottled.WaterFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01069407 0.5151515 0.02075907 3.198903 204
The smallest value of lift is 3.199
c. Items Purchased Together
Lift values that are greater than one would indicate that these items are puchased together. All values of lift contained within the rules generated are greater than that of one, meaning that they are frequently purchased together.
d. Average Confidence
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
## [7] {Medium.DrinkFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01362969 0.5273834 0.02584399 3.274858 260
## [8] {Bottled.WaterFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01069407 0.5151515 0.02075907 3.198903 204
## Warning in mean.default(sum1, na.rm = TRUE): argument is not numeric or
## logical: returning NA
## [1] NA
The mean of confidence is equal to .679
e. Median Support
determine the median support obtained for the association rules calculated.
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
## [7] {Medium.DrinkFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01362969 0.5273834 0.02584399 3.274858 260
## [8] {Bottled.WaterFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01069407 0.5151515 0.02075907 3.198903 204
## NULL
The median support is .0159
f. Create Visualizations
Create a plot to visualize the support, confidence, and lift of the rules calculated.
g. Store Benefit
{ToppingFood} => {Ice.Cream.ConeFood}
These items are frequently purchased together as they go hand on. Additionally, these items are likely have a high level of seasonality. During a downturn of sales, these items could be place at a slightly reduced cost, also known as a sale, at the same time and having additional toppings options placed in the ice cream section.
{Chicken.TendersFood} => {French.Fries.BasketFood}
Chicken tenders and fries are an iconic and staple! Now that it is known that there are frequently purchased at the same time, a combined item should be created or discount on the addition of a side item should be incorporated to increase sales of both items. Or if they purchase fries, a free drink could be offered.
Part 2: Bagging and Random Forests
1. Convert non-numeric columns to factors
2. Check for Missing Values
## [1] 3
## [1] 169 285 1226
## # A tibble: 3 × 7
## Passenger_Class Survived Sex Sibling_or_Spouse Parent_or_Ch…¹ Fare Embar…²
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <fct>
## 1 1 Lived female 0 0 80 <NA>
## 2 1 Lived female 0 0 80 <NA>
## 3 3 Died male 0 0 NA Southh…
## # … with abbreviated variable names ¹Parent_or_Child, ²Embarked
There are 3 missing values, 2 embarked, one with fare
Impute
## Count
## Cherbourg 270
## Queenstown 123
## Southhampton 914
## NA's 2
ggplot(Titanic, aes(x = Embarked, fill = Embarked)) +
geom_bar() +
labs(title = "Passengers Embarked from Each City",
x = "Embarked City",
y = "Count")## [1] 33.29548
Titanic$Embarked[is.na(Titanic$Embarked)] <- "Southhampton"
Titanic$Fare[is.na(Titanic$Fare)] <- mean(Titanic$Fare, na.rm = TRUE)
sum(is.na(Titanic))## [1] 0
Southhampton has the greatest count and will be the impute value for NAs for the Embarked city
Fare will be imputed with the mean 33.29548.
After these changes, there is now 0 items that have NA values.
3. Randomly Sample Rows
Sample the rows of your data to include 90% of the rows in your training set
4. Bagging Model
Develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.
##
## Bagging classification trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = train, coob = T,
## nbagg = 300)
##
## Out-of-bag estimate of misclassification error: 0.2097
####. a OOB Error
## [1] Lived Died Lived Lived Lived Died Died Lived Lived Lived Lived Lived
## [13] Died Died Lived Lived Died Lived Died Lived Lived Died Lived Lived
## [25] Died Died Lived Died Died Died Lived Lived Died Lived Lived Died
## [37] Died Died Lived Died Lived Lived Died Lived Died Died Lived Died
## [49] Died Died Died Died Lived Lived Died Lived Lived Died Died Died
## [61] Died Lived Lived Died Died Died Lived Died Lived Died Lived Died
## [73] Lived Died Died Died Lived Died Lived Died Lived Died Died Died
## [85] Died Died Died Lived Died Died Died Died Died Died Died Lived
## [97] Died Died Lived Lived Lived Lived Died Lived Died Died Died Died
## [109] Died Died Died Died Died Died Died Died Died Died Died Died
## [121] Died Lived Died Died Died Died Died Died Died Died Died
## Levels: Died Lived
## [1] 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0
## [38] 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0
## [75] 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [1] 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0
## [38] 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0
## [75] 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0
## [112] 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
Converted data types to be in 0 and 1 to get a calculated value.
## [1] 0.1755725
## [1] 23
The OOB error is .2199, this is generally considered to be a “good”
or acceptable value for OOB. The lower this value is, the better
performance of a model.
The mean squared error is .0992 meanng that 9.92% of the time the model
is making incorrect predictions. Which makes sense, if we look at the
predicted values for the data in comparison to that of the actual data
from the test data set, they are very comparable. There are 13 values
that are not predicted correctly. Hence, based upon a low OOB and a low
MSE I would conclude that this model is performing quite well!
5.Random Forest
a. OOB
Determine the OOB error for this model
##
## Call:
## randomForest(formula = Survived ~ ., data = train, importance = TRUE, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.8%
## Confusion matrix:
## Died Lived class.error
## Died 644 78 0.1080332
## Lived 167 289 0.3662281
The OOB error estimate is .1944 or 19.44%
Considering that the goal is to have a lower OOB error, the random forest would be performing slightly better than that of the bagging model. Yet, there is a trade off when choosing a model of greater complexity, hence both models would be useful.
c. and d.
The false positive rate is equal to .0825 or 7.771% These are the cases where the model predicted that Lived but the person died. The false negatve rate is equal to .3895 or 38.27%. These are the cases where the model predicted Died but the person lived.
e. Plot OOB
Create a plot comparing the out-of-bag error, the false positive rate, the false negative rate, and the number of trees
plot(mod_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))
This model likely could have been simplified. It appears tha after a
much lower amount of trees the OOB stabilizes. Which would be indicative
that we could use less.
test theory:
##
## Call:
## randomForest(formula = Survived ~ ., data = train, importance = TRUE, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.37%
## Confusion matrix:
## Died Lived class.error
## Died 645 77 0.1066482
## Lived 163 293 0.3574561
Reduced trees and only see the slightest increase in OOB. Additionally the classification rate errors have no increased by a substantial amount either.