Some questions on this homework involve random sampling, and so each student may obtain slightly different results. Because of this, I need to see both your code and resulting output. For simplicity and to ensure that your output matches your textual explanations and answers, I will accept copies of your output placed in Microsoft Word for this assignment. However if you feel more comfortable doing so, you can also alternatively submit an HTML document produced in R Markdown. Just be sure that the HTML document displays your output and that your written answers match the displayed output.
Use R Markdown and the TransFood excel file to answer all of the questions in Part 1 of this homework.
#import the read excel library
library(readxl)
#install.packages("arules")
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
TransFood <- read_excel("/Users/kamriefoster/Downloads/TransFood.xlsx")
TransFood <- as(as.matrix(TransFood), "transactions")
Each row in the dataset represents items purchased by a particular customer during one visit to a store. What is the largest total number of items purchased by a customer during one visit (i.e., what is the largest number of 1’s in any single row)? (Hint: You can answer this question by looking at a summary of the dataset.)
summary(TransFood)
## transactions as itemMatrix in sparse format with
## 19076 rows (elements/itemsets/transactions) and
## 118 columns (items) and a density of 0.02230729
##
## most frequent items:
## Bottled.WaterFood Slice.of.CheeseFood Medium.DrinkFood Small.DrinkFood
## 3166 3072 2871 2769
## Slice.of.PeppFood (Other)
## 2354 35981
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 197 5675 5178 3253 2129 1293 655 351 178 95 42 14 8 7 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.632 4.000 15.000
##
## includes extended item information - examples:
## labels
## 1 Add.CheeseFood
## 2 BeerFood
## 3 Bottled.WaterFood
The largest total number of items purchased by a customer during one visit was 15.
itemFrequencyPlot(TransFood, support = 0.1, cex.names = 0.8)
The item most frequently bought by customers is Bottled.WaterFood according to the frequency plot created.
## Run the apriori algorithm
#creating the association rules
basket_rules <- apriori(TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 190
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# find all rules with a minimum support of 1% and a minimum of confidence of 50%, maxlen by default is set to 10 (meaning that you can have no more than 10 items in a rule)
summary(basket_rules)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01012 Min. :0.5152 Min. :0.01735 Min. : 3.199
## 1st Qu.:0.01290 1st Qu.:0.5516 1st Qu.:0.02115 1st Qu.: 3.534
## Median :0.01594 Median :0.6398 Median :0.02432 Median : 7.687
## Mean :0.01641 Mean :0.6790 Mean :0.02387 Mean : 6.965
## 3rd Qu.:0.01776 3rd Qu.:0.7672 3rd Qu.:0.02696 3rd Qu.: 8.331
## Max. :0.02857 Max. :0.9982 Max. :0.02862 Max. :13.181
## count
## Min. :193.0
## 1st Qu.:246.0
## Median :304.0
## Mean :313.1
## 3rd Qu.:338.8
## Max. :545.0
##
## mining info:
## data ntransactions support confidence
## TransFood 19076 0.01 0.5
## call
## apriori(data = TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
There are 8 association rules that have a support of at least 1% and confidence of at least 50%.
The minimum lift value obtained for the association rules calculated is 3.199.
Since all lift values are greater than 1, the rules indicate that the items in each rule tend to be bought together. If you purchase the item(s) in the rule then, you are more likely to purchase the other item(s) in the rule.
d.What is the average confidence obtained for the association rules calculated in this question?
The average confidence is 0.6398 for the association rules calculated in this question.
The median support obtained for the association rules calculated in this question is 0.01594.
#install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)
# Check the generated rules using inspect
inspect(basket_rules)
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
## [7] {Medium.DrinkFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01362969 0.5273834 0.02584399 3.274858 260
## [8] {Bottled.WaterFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01069407 0.5151515 0.02075907 3.198903 204
Rule 5:
Add. CheeseburgerFood => French.Fries.BasketFood
This rule indicates that people who buy a cheese burger are more
likely to buy a basket of fries.
Rule 4: Chicken.TendersFood => French.Fries.BasketFood
This rule indicates that people who buy chicken tenders and more
likely to buy a basket of fries.
Knowing this, a store could sell these items as a combo deal as opposed to selling them separately. For example, like McDonald’s and other fast food restaurants provide an option on the menu for a meal where these two items are included for one price. Since these combinations occur so often it will provide a convenience factor for customers. With a meal option people will likely feel more inclined to buy both the burger or tenders with fries as opposed to just the burger or tenders without fries. The business could also increase the price of the meal slightly more than that of the two added together and get no argument from too many customers (as many people won’t take the time to do this calculation). So now the business will not only sell more of each product but potentially sell them at a higher price increasing sales and profit.
Use R Markdown and the Titanic passenger list excel file to answer all of the questions in Part 2 of this homework. This dataset contains information about all passengers (excluding crew members) on the Titanic.
Titanic <- read_excel("/Users/kamriefoster/Downloads/Titanic passenger list.xlsx")
Titanic <- as.data.frame(Titanic)
Titanic$Survived <- as.factor(Titanic$Survived)
Titanic$Sex <- as.factor(Titanic$Sex)
Titanic$Embarked <- as.factor(Titanic$Embarked)
sum(is.na(Titanic))
## [1] 3
There are 3 missing values in the dataset that should be handled.
table(Titanic$Embarked)
##
## Cherbourg Queenstown Southhampton
## 270 123 914
#can also view this by using the "mode" function in R
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
result <- getmode(Titanic$Embarked)
print(result)
## [1] Southhampton
## Levels: Cherbourg Queenstown Southhampton
The largest number of passengers embarked from Southhampton according to the table generated above.
Impute all missing values in the Embarked column with this location.
# Impute missing values by the most common location
Titanic$Embarked[is.na(Titanic$Embarked)] <- getmode(Titanic$Embarked)
sum(is.na(Titanic$Embarked))
## [1] 0
# Impute missing values by median
Titanic$Parent_or_Child[is.na(Titanic$Parent_or_Child)] <- median(Titanic$Parent_or_Child, na.rm = T)
Titanic$Sibling_or_Spouse[is.na(Titanic$Sibling_or_Spouse)] <- median(Titanic$Sibling_or_Spouse, na.rm = T)
Titanic$Fare[is.na(Titanic$Fare)] <- median(Titanic$Fare, na.rm = T)
sum(is.na(Titanic))
## [1] 0
index <- sample(nrow(Titanic), nrow(Titanic)*0.90)
Titanic_train = Titanic[index,]
Titanic_test = Titanic[-index,]
### Create the bagging model by using the following library ###
#install.packages('ipred')
library(ipred)
Developing the model
Titanic_bag <- bagging(formula = Survived~., data = Titanic_train, nbagg = 300)
#300 bootstrap samples, and we fit a tree model to each of these 300 bootstrap samples. Final is the average of the 300 trees.
#nbag is the number of bootstraps we want
Titanic_bag
##
## Bagging classification trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = Titanic_train,
## nbagg = 300)
Titanic_bag_oob <- bagging(formula = Survived~.,
data = Titanic_train,
coob = T,
nbagg = 300) #coob = T means that it will automatically calculate the out-of-bag prediction error, finds the mean squared error
Titanic_bag_oob
##
## Bagging classification trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = Titanic_train,
## coob = T, nbagg = 300)
##
## Out-of-bag estimate of misclassification error: 0.208
The out of bag estimate of misclassification error was found to be around 0.2037.
Titanic_bag_pred <- predict(Titanic_bag, newdata = Titanic_test)
Titanic_bag_pred
## [1] Died Lived Lived Lived Lived Lived Lived Died Died Lived Died Lived
## [13] Lived Lived Lived Died Lived Lived Lived Lived Lived Lived Lived Lived
## [25] Lived Lived Lived Lived Lived Lived Died Died Died Died Lived Lived
## [37] Died Died Lived Died Died Died Died Died Lived Lived Died Died
## [49] Died Lived Died Died Lived Lived Lived Lived Lived Died Lived Lived
## [61] Lived Died Lived Died Lived Died Died Died Died Lived Died Lived
## [73] Died Died Died Died Died Lived Lived Died Died Died Died Lived
## [85] Died Died Died Died Died Lived Lived Died Died Died Died Died
## [97] Died Lived Lived Died Died Died Died Lived Died Died Died Lived
## [109] Died Died Lived Died Died Died Died Died Died Lived Lived Died
## [121] Died Died Died Died Died Died Died Died Died Died Died
## Levels: Died Lived
table(Titanic_bag_pred)
## Titanic_bag_pred
## Died Lived
## 76 55
#packages and libraries needed
#install.packages('randomForest')
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
Titanic_rf <- randomForest(as.factor(Survived)~.,
data = Titanic_train,
importance = TRUE, ntree = 300)
Titanic_rf
##
## Call:
## randomForest(formula = as.factor(Survived) ~ ., data = Titanic_train, importance = TRUE, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.95%
## Confusion matrix:
## Died Lived class.error
## Died 661 65 0.08953168
## Lived 170 282 0.37610619
The out-of-bag error for this model is around 0.1961.
Based on the out-of-bag errors, the random forest model acts slightly better than the bagging model. However, the two models return approximately the same out-of-bag error thus it would be better to state that these models seem to perform with about the same accuracy (depending on testing and training datasets- since they are randomized each time they run).
The false positive rate is 0.08493151.
The false negative rate is 0.37723214.
#plotting the random forest we made where line width is 2 for all three lines.
plot(Titanic_rf, lwd=rep(2, 3))
#legend is at the right, legend has width of 2
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))
Titanic_rf_pred <- predict(Titanic_rf, newdata = Titanic_test)
Titanic_rf_pred
## 6 12 13 14 18 23 50 59 61 62 97 102 112
## Died Lived Lived Lived Lived Died Died Died Died Lived Died Died Lived
## 122 127 133 134 147 150 170 177 207 211 231 239 254
## Lived Died Died Died Lived Lived Lived Lived Lived Died Lived Lived Lived
## 274 279 283 290 314 318 319 352 362 370 386 390 402
## Died Died Lived Lived Died Died Died Died Lived Lived Died Died Lived
## 405 422 423 424 449 476 483 492 493 498 503 528 533
## Died Died Died Died Died Lived Lived Died Died Died Lived Died Died
## 535 537 549 552 558 575 587 600 602 630 661 666 668
## Lived Lived Lived Lived Lived Died Lived Lived Died Died Lived Died Lived
## 670 674 678 686 693 694 703 712 714 719 732 750 765
## Died Died Died Died Lived Died Lived Died Died Died Died Died Died
## 766 770 774 775 776 781 798 813 815 817 827 841 873
## Lived Died Died Died Died Died Died Died Died Died Died Died Died
## 903 908 921 922 931 932 940 947 952 958 975 1011 1015
## Died Died Died Died Died Died Lived Lived Died Died Died Died Died
## 1021 1028 1031 1041 1044 1045 1051 1065 1072 1081 1086 1096 1107
## Died Died Died Lived Died Lived Lived Died Died Died Died Lived Died
## 1114 1123 1143 1158 1171 1194 1205 1209 1210 1252 1258 1271 1283
## Lived Lived Died Died Died Died Died Died Died Died Died Died Died
## 1299
## Died
## Levels: Died Lived
table(Titanic_rf_pred)
## Titanic_rf_pred
## Died Lived
## 88 43