Use R Markdown and the TransFood excel file to answer all of the questions in Part 1 of this homework.
Load the TransFood excel file into R as a dataset of transactions. Each row in the dataset represents items purchased by a particular customer during one visit to a store.
tf <- read_xlsx("C:/Users/justt/Desktop/School/621/Assignment/Homework 5/TransFood.xlsx")
What is the largest total number of items purchased by a customer during one visit (i.e., what is the largest number of 1’s in any single row)? (Hint: You can answer this question by looking at a summary of the dataset.)
tf <- as(as.matrix(tf), "transactions")
summary(tf)
## transactions as itemMatrix in sparse format with
## 19076 rows (elements/itemsets/transactions) and
## 118 columns (items) and a density of 0.02230729
##
## most frequent items:
## Bottled.WaterFood Slice.of.CheeseFood Medium.DrinkFood Small.DrinkFood
## 3166 3072 2871 2769
## Slice.of.PeppFood (Other)
## 2354 35981
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 197 5675 5178 3253 2129 1293 655 351 178 95 42 14 8 7 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.632 4.000 15.000
##
## includes extended item information - examples:
## labels
## 1 Add.CheeseFood
## 2 BeerFood
## 3 Bottled.WaterFood
The largest total number of items purchased by a customer in one visit is 15 items and this occurred 1 time.
summary(tf)
## transactions as itemMatrix in sparse format with
## 19076 rows (elements/itemsets/transactions) and
## 118 columns (items) and a density of 0.02230729
##
## most frequent items:
## Bottled.WaterFood Slice.of.CheeseFood Medium.DrinkFood Small.DrinkFood
## 3166 3072 2871 2769
## Slice.of.PeppFood (Other)
## 2354 35981
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 197 5675 5178 3253 2129 1293 655 351 178 95 42 14 8 7 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.632 4.000 15.000
##
## includes extended item information - examples:
## labels
## 1 Add.CheeseFood
## 2 BeerFood
## 3 Bottled.WaterFood
The most frequent item purchased, per the summary, was Bottled.WaterFood and was purchased 3,166 times.
itemFrequencyPlot(tf, support = 0.1, cex.names = 0.8)
The most frequent item purchased, per the frequency plot, was Bottled.WaterFood and was purchased 3,166 times.
Perform association analysis to determine rules with support of at least 1% and confidence of at least 50%. See chart below.
tfbasket <- apriori(tf, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 190
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.01s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(tfbasket)
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
## [7] {Medium.DrinkFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01362969 0.5273834 0.02584399 3.274858 260
## [8] {Bottled.WaterFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01069407 0.5151515 0.02075907 3.198903 204
tfbasket
## set of 8 rules
There are 8 rules where the support is at least 1% and confidence is at least 50%.
inspect(subset(tfbasket, min(lift)))
## lhs rhs support confidence
## [1] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649
## coverage lift count
## [1] 0.02746907 7.601643 365
Based off this rcode output, the min lift is 7.601643. However, if I visully look at the tfbasket data set, I see that there are other smaller values. Based off looking at the table, the smallest lift is 3.198903. Why is the min function pulling back 7.601643?
inspect(subset(tfbasket, lift > 1))
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
## [7] {Medium.DrinkFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01362969 0.5273834 0.02584399 3.274858 260
## [8] {Bottled.WaterFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01069407 0.5151515 0.02075907 3.198903 204
The lift of all 8 rules are greater than 1, meaning that these items in each rule are frequently purchased together.
summary(tfbasket)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01012 Min. :0.5152 Min. :0.01735 Min. : 3.199
## 1st Qu.:0.01290 1st Qu.:0.5516 1st Qu.:0.02115 1st Qu.: 3.534
## Median :0.01594 Median :0.6398 Median :0.02432 Median : 7.687
## Mean :0.01641 Mean :0.6790 Mean :0.02387 Mean : 6.965
## 3rd Qu.:0.01776 3rd Qu.:0.7672 3rd Qu.:0.02696 3rd Qu.: 8.331
## Max. :0.02857 Max. :0.9982 Max. :0.02862 Max. :13.181
## count
## Min. :193.0
## 1st Qu.:246.0
## Median :304.0
## Mean :313.1
## 3rd Qu.:338.8
## Max. :545.0
##
## mining info:
## data ntransactions support confidence
## tf 19076 0.01 0.5
## call
## apriori(data = tf, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
By looking at a summary of the tfbasket rules, we can see that the average (mean) of the Confidence is 0.6790.
summary(tfbasket)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01012 Min. :0.5152 Min. :0.01735 Min. : 3.199
## 1st Qu.:0.01290 1st Qu.:0.5516 1st Qu.:0.02115 1st Qu.: 3.534
## Median :0.01594 Median :0.6398 Median :0.02432 Median : 7.687
## Mean :0.01641 Mean :0.6790 Mean :0.02387 Mean : 6.965
## 3rd Qu.:0.01776 3rd Qu.:0.7672 3rd Qu.:0.02696 3rd Qu.: 8.331
## Max. :0.02857 Max. :0.9982 Max. :0.02862 Max. :13.181
## count
## Min. :193.0
## 1st Qu.:246.0
## Median :304.0
## Mean :313.1
## 3rd Qu.:338.8
## Max. :545.0
##
## mining info:
## data ntransactions support confidence
## tf 19076 0.01 0.5
## call
## apriori(data = tf, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
By looking at a summary of the tfbasket rules, we can see that the median Support is 0.01594.
library('arulesViz')
plot(tfbasket)
The above plot shows the visualization of the Support, Confidence, and Lift of the 8 rules.
Rules 4. When Chicken.TendersFood is purchased there is usually French.Fries.BasketFood purchased at the same time. 5. When CheeseburgerFood is purchased there is usually French.Fries.BasketFood purchased at the same time.
Assumption French.Fries.BasketFood is French Fries, CheeseburgerFood is Cheeseburgers, and Chicken.TendersFood is Chicken Tenders
I would recommend based on these two rules, that the store could increase the sales of French Fries, when purchasing Chicken Tenders or Cheeseburgers by locating French Fries nearby both of these items. Also if that isn’t possible, then a sign could be placed on the freezer door for Chicken Tenders and Cheeseburgers reminding the shopper to purchase French Fries and indicate which aisle they are in.
Use R Markdown and the Titanic passenger list excel file to answer all of the questions in Part 2 of this homework. This dataset contains information about all passengers (excluding crew members) on the Titanic.
Load the Titanic passenger list excel file into R, convert the dataset into a data frame, and convert all non-numeric columns of data into factors.
titanic <- read_xlsx("C:/Users/justt/Desktop/School/621/Assignment/Homework 5/Titanic passenger list.xlsx")
titanic <- as.data.frame(unclass(titanic),stringsAsFactors=TRUE)
str(titanic)
## 'data.frame': 1309 obs. of 7 variables:
## $ Passenger_Class : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Survived : Factor w/ 2 levels "Died","Lived": 2 2 1 1 1 2 2 1 2 1 ...
## $ Sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ Sibling_or_Spouse: num 0 1 1 1 1 0 1 0 2 0 ...
## $ Parent_or_Child : num 0 2 2 2 2 0 0 0 0 0 ...
## $ Fare : num 211 152 152 152 152 ...
## $ Embarked : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 3 3 3 3 3 3 1 ...
sum(is.na(titanic))
## [1] 3
Yes, there are 3 missing values in the data set.
summary(titanic$Embarked)
## Cherbourg Queenstown Southhampton NA's
## 270 123 914 2
There were 270 passengers who embarked from Cherbourg. There were 123 passengers who embarked from Queenstown. There were 914 passengers who embarked from Southhampton. And 2 missing values.
From which location did the largest number of passengers embark?
The largest number of passengers embarked from Southhampton.
Impute all missing values in the Embarked column with this location.
titanic$Embarked[is.na(titanic$Embarked)==T] <- "Southhampton"
summary(titanic$Embarked)
## Cherbourg Queenstown Southhampton
## 270 123 916
titanic$Fare[is.na(titanic$Fare)==T] <- mean(titanic$Fare, na.rm = T)
str(titanic$Fare)
## num [1:1309] 211 152 152 152 152 ...
summary(titanic$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.896 14.454 33.295 31.275 512.329
sum(is.na(titanic))
## [1] 0
Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.
ind <- sample(nrow(titanic), nrow(titanic)*0.9)
titanic_train <- titanic[ind,]
str(titanic_train)
## 'data.frame': 1178 obs. of 7 variables:
## $ Passenger_Class : num 1 1 2 1 3 3 1 3 2 3 ...
## $ Survived : Factor w/ 2 levels "Died","Lived": 2 2 2 2 2 2 2 2 2 1 ...
## $ Sex : Factor w/ 2 levels "female","male": 1 2 2 2 1 1 1 1 2 2 ...
## $ Sibling_or_Spouse: num 1 2 1 0 4 0 0 1 1 0 ...
## $ Parent_or_Child : num 0 2 1 0 2 0 2 1 1 0 ...
## $ Fare : num 55.4 262.4 29 31 31.4 ...
## $ Embarked : Factor w/ 3 levels "Cherbourg","Queenstown",..: 1 1 3 1 3 2 3 1 3 3 ...
titanic_test <- titanic[-ind,]
str(titanic_test)
## 'data.frame': 131 obs. of 7 variables:
## $ Passenger_Class : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Survived : Factor w/ 2 levels "Died","Lived": 2 2 2 1 1 2 2 1 1 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
## $ Sibling_or_Spouse: num 0 0 0 1 0 1 1 0 0 3 ...
## $ Parent_or_Child : num 0 0 1 0 0 1 0 0 0 2 ...
## $ Fare : num 26.6 26.6 55 136.8 25.6 ...
## $ Embarked : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 1 3 3 1 1 1 3 ...
Using 300 bootstrapped sets, develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.
titanic_bag <- bagging(formula = Survived~., data = titanic_train, nbagg = 300)
titanic_bag
##
## Bagging classification trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train,
## nbagg = 300)
titanic_bag_pred <- predict(titanic_bag, newdata = titanic_train)
summary(titanic_bag_pred)
## Died Lived
## 747 431
Use the screenshot from 4a) in the submitted word doc. This predicts that 779 passengers died and 399 lived based on the training set.
titanic_bag_oob <- bagging(formula = Survived~.,
data = titanic_train,
coob = T,
nbagg = 300)
titanic_bag_oob
##
## Bagging classification trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train,
## coob = T, nbagg = 300)
##
## Out-of-bag estimate of misclassification error: 0.2029
The oob error (MSE) for my training model was 0.2097.
titanic_bag_pred1 <- predict(titanic_bag, newdata = titanic_test)
summary(titanic_bag_pred1)
## Died Lived
## 80 51
Use the screenshot from 4b) in the submitted word doc. This predicts that 94 passengers died and 37 lived based on the testing set.
Using 300 trees, develop a random forest model on your training set to predict whether or not a passenger survived the sinking of the Titanic.
titanic_rf <- randomForest(Survived~., data=titanic_train, importance = TRUE)
titanic_rf
##
## Call:
## randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.19%
## Confusion matrix:
## Died Lived class.error
## Died 667 62 0.08504801
## Lived 164 285 0.36525612
mod_rf <- randomForest(Survived~., data=titanic_train, importance = TRUE, ntree = 300)
mod_rf
##
## Call:
## randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.19%
## Confusion matrix:
## Died Lived class.error
## Died 668 61 0.08367627
## Lived 165 284 0.36748330
mod_rf
##
## Call:
## randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.19%
## Confusion matrix:
## Died Lived class.error
## Died 668 61 0.08367627
## Lived 165 284 0.36748330
Use the screenshot from 5) in the submitted word doc. The out-of-bag estimate error is 0.202 (20.2%).
titanic_bag_oob
##
## Bagging classification trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train,
## coob = T, nbagg = 300)
##
## Out-of-bag estimate of misclassification error: 0.2029
mod_rf
##
## Call:
## randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.19%
## Confusion matrix:
## Died Lived class.error
## Died 668 61 0.08367627
## Lived 165 284 0.36748330
Use the screenshot from 5b) in the submitted word doc. The bagging model oob error was 0.2097 (20.97%), and the random forest oob error was 0.2020 (20.2%). The random forest model was better (smaller) then the bagging model.
Use the screenshot from 5b) in the submitted word doc.The False Positive Rate is 83/1,178 = 0.07045 (70.45%).
Use the screenshot from 5b) in the submitted word doc. The false negative rate is 155/1,178 = 0.13157 (13.16%).
plot(mod_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))
Use the screenshot from 5e) in the submitted word doc. No, it appears that less than 50, closer to 25 trees are needed.
titanic_rf_pred <- predict(titanic_rf, titanic_test)
summary(titanic_rf_pred)
## Died Lived
## 92 39
Use the screenshot from 5f) in the submitted word doc. This predicts that 92 passengers died and 39 lived based on the testing set with my random forest model.