#Part 1: Association Analysis Use R Markdown and the TransFood excel file to answer all of the questions in Part 1 of this homework.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library("readxl")
Food <- read_excel("/Users/jusimioni/Desktop/TransFood.xlsx")
Food <- as(as.matrix(Food), "transactions") #coerces the data into transcations for association rule mining
summary(Food)
## transactions as itemMatrix in sparse format with
## 19076 rows (elements/itemsets/transactions) and
## 118 columns (items) and a density of 0.02230729
##
## most frequent items:
## Bottled.WaterFood Slice.of.CheeseFood Medium.DrinkFood Small.DrinkFood
## 3166 3072 2871 2769
## Slice.of.PeppFood (Other)
## 2354 35981
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 197 5675 5178 3253 2129 1293 655 351 178 95 42 14 8 7 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.632 4.000 15.000
##
## includes extended item information - examples:
## labels
## 1 Add.CheeseFood
## 2 BeerFood
## 3 Bottled.WaterFood
itemFrequencyPlot(Food, support=0.10, cex.names=0.8)
Looking at the items that have a support of at least 10%. The highest
total number is Bottle of water + food is the most purchase item with
3166 transactions.
food_rules <- apriori(Food, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 190
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(food_rules)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01012 Min. :0.5152 Min. :0.01735 Min. : 3.199
## 1st Qu.:0.01290 1st Qu.:0.5516 1st Qu.:0.02115 1st Qu.: 3.534
## Median :0.01594 Median :0.6398 Median :0.02432 Median : 7.687
## Mean :0.01641 Mean :0.6790 Mean :0.02387 Mean : 6.965
## 3rd Qu.:0.01776 3rd Qu.:0.7672 3rd Qu.:0.02696 3rd Qu.: 8.331
## Max. :0.02857 Max. :0.9982 Max. :0.02862 Max. :13.181
## count
## Min. :193.0
## 1st Qu.:246.0
## Median :304.0
## Mean :313.1
## 3rd Qu.:338.8
## Max. :545.0
##
## mining info:
## data ntransactions support confidence
## Food 19076 0.01 0.5
## call
## apriori(data = Food, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))
inspect(head(food_rules))
## lhs rhs support confidence coverage lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266 0.5596869 0.02678759 13.180972 286
## [2] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545
## [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} 0.01913399 0.6965649 0.02746907 7.601643 365
## [4] {Chicken.TendersFood} => {French.Fries.BasketFood} 0.01729922 0.7586207 0.02280352 7.771992 330
## [5] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322
## [6] {GatoradeFood,
## Slice.of.PeppFood} => {Slice.of.CheeseFood} 0.01011743 0.5830816 0.01735165 3.620724 193
The items are frequently bought together. For example, buying chicken
tender and french fries, or when you buy a chocolate souvenir refill
buying the hot chocolate food. Since the lift is above 1 buy a lot we
would assume this items are very frequently bought together. d) What is
the average confidence obtained for the association rules calculated in
this question?
The Average confidence was 0.6790. e) What is the median support
obtained for the association rules calculated in this question?
the median support was 0.01290.
f) Create a plot to visualize the support, confidence, and lift of the
rules calculated in this question.
library('arulesViz')
plot(food_rules)
plot(food_rules)
Use R Markdown and the Titanic passenger list excel file to answer
all of the questions in Part 2 of this homework. This dataset contains
information about all passengers (excluding crew members) on the
Titanic.
1. Load the Titanic passenger list excel file into R, convert the
dataset into a data frame, and convert all non-numeric columns of data
into factors.
library('xlsx')
titanic <- read.xlsx("/Users/jusimioni/Desktop/Titanic passenger list.xlsx", sheetIndex = 1)
titanic <- as.data.frame(titanic)
colnames(titanic)
## [1] "Passenger_Class" "Survived" "Sex"
## [4] "Sibling_or_Spouse" "Parent_or_Child" "Fare"
## [7] "Embarked"
Survived
titanic$Survived = as.factor(titanic$Survived)
levels(titanic$Survived)
## [1] "Died" "Lived"
titanic$Survived=as.numeric(titanic$Survived,"Lived"=1, "Died"=2)
Sex
titanic$Sex = as.factor(titanic$Sex)
levels(titanic$Sex)
## [1] "female" "male"
titanic$Sex=as.numeric(titanic$Sex,"female"=1, "Male"=0)
Embarked
titanic$Embarked = as.factor(titanic$Embarked)
levels(titanic$Embarked)
## [1] "Cherbourg" "Queenstown" "Southhampton"
titanic$Embarked=as.numeric(titanic$Embarked,"Cherbourg"=1, "Queenstown"=2, "Southampton"=3)
sum(is.na(titanic))
## [1] 3
sum(is.na(titanic$Survived))
## [1] 0
sum(is.na(titanic$Sex))
## [1] 0
sum(is.na(titanic$Sibling_or_Spouse))
## [1] 0
sum(is.na(titanic$Parent_or_Child))
## [1] 0
sum(is.na(titanic$Fare))
## [1] 1
sum(is.na(titanic$Embarked))
## [1] 2
There is one missing value for Fare and two missing values for Embarked.
hist(titanic$Embarked)
The largest amount of passengers embarked from number 3 that is
Southampton.
titanic$Embarked[is.na(titanic$Embarked)] <- 3
titanic$Fare[is.na(titanic$Fare)]<-mean(titanic$Fare,na.rm=TRUE)
index <- sample(nrow(titanic), nrow(titanic)*0.90)
titanic_train <- titanic[index,]
titanic_test <- titanic[-index,]
library(ipred)
titanic_bag <- bagging(formula = Survived~., data = titanic_train,coob =T, nbagg = 300)
titanic_bag
##
## Bagging regression trees with 300 bootstrap replications
##
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train,
## coob = T, nbagg = 300)
##
## Out-of-bag estimate of root mean squared error: 0.3798
titanic_tree_pred <- predict(titanic_bag, newdata = titanic_test)
mean((titanic_test$Survived - titanic_tree_pred)^2)
## [1] 0.1517067
ntree <- c(1, 3, 5, seq(20, 300, 18))
MSE_test <- rep(0, length(ntree))
for(i in 1:length(ntree)){
titanic_bag1 <- bagging(Survived~., data = titanic_train, nbagg = ntree[i])
titanic_bag_pred1 <- predict(titanic_bag1, newdata = titanic_test)
MSE_test[i] <- mean((titanic_test$Survived - titanic_bag_pred1)^2)
}
plot(ntree, MSE_test, type = 'l', col = 2, lwd = 2, xaxt = "n")
axis(1, at = ntree, las = 1)
5. Using 300 trees, develop a random forest model on your training set
to predict whether or not a passenger survived the sinking of the
Titanic.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
titanic_rf <- randomForest(Survived~., data = titanic_train, importance = TRUE, ntree = 300)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
titanic_rf # mean of square residuals in the output is the out-of-bag mean squared error
##
## Call:
## randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE, ntree = 300)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1397338
## % Var explained: 40.91
What is the out-of-bag error for your model?
The OOB for random forest is the mean of squared residual what in this
case equals 0.1373
Based on the out-of-bag error, is the random forest better at
predicting survival than the bagging model? Or is the bagging model
better than the random forest? Or do both models seem to perform with
about the same accuracy?
The random forest model has a a smaller error than the bagging model.
The Bagging model has a error of 0.3 where random florest is presenting
an error of 0.13, but random florest does not have a high explanation of
the model since it only expalins 41.78% of the model.
What is the false positive rate (for your random forest model developed with 300 trees)?
titanic_rf2 <- randomForest(as.factor(Survived)~.,
data = titanic_train,
importance = TRUE, ntree = 300)
titanic_rf2
##
## Call:
## randomForest(formula = as.factor(Survived) ~ ., data = titanic_train, importance = TRUE, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.71%
## Confusion matrix:
## 1 2 class.error
## 1 660 66 0.09090909
## 2 178 274 0.39380531
The false positive was 74 cases (74/(942)) = 0.07855 or 7.85%
d) What is the false negative rate (for your random forest model
developed with 300 tress)?
The false negative was 162 cases (162/942) = 0.1719 or 17.19%
plot(titanic_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))
f) Use your random forest model to make predictions for the observations
in your testing set. The random Florest prediction 0.1644, the bagging
had a sligthly smaller prediction than random florest (bagging error
0.1600).
titanic_pred <- predict(titanic_rf, newdata = titanic_test)
mean((titanic_test$Survived - titanic_pred)^2)
## [1] 0.1444436