Homework 5

#Part 1: Association Analysis Use R Markdown and the TransFood excel file to answer all of the questions in Part 1 of this homework.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library("readxl")
Food <- read_excel("/Users/jusimioni/Desktop/TransFood.xlsx")

Food <- as(as.matrix(Food), "transactions") #coerces the data into transcations for association rule mining

Load the TransFood excel file into R as a dataset of transactions. Each row in the dataset represents items purchased by a particular customer during one visit to a store. What is the largest total number of items purchased by a customer during one visit (i.e., what is the largest number of 1’s in any single row)? (Hint: You can answer this question by looking at a summary of the dataset.)

summary(Food)

## transactions as itemMatrix in sparse format with
##  19076 rows (elements/itemsets/transactions) and
##  118 columns (items) and a density of 0.02230729 
## 
## most frequent items:
##   Bottled.WaterFood Slice.of.CheeseFood    Medium.DrinkFood     Small.DrinkFood 
##                3166                3072                2871                2769 
##   Slice.of.PeppFood             (Other) 
##                2354               35981 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##  197 5675 5178 3253 2129 1293  655  351  178   95   42   14    8    7    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.632   4.000  15.000 
## 
## includes extended item information - examples:
##              labels
## 1    Add.CheeseFood
## 2          BeerFood
## 3 Bottled.WaterFood

Which item is most frequently bought by customers? Create a frequency plot to support your answer.

itemFrequencyPlot(Food, support=0.10, cex.names=0.8)

Looking at the items that have a support of at least 10%. The highest total number is Bottle of water + food is the most purchase item with 3166 transactions.

Perform association analysis to determine rules with support of at least 1% and confidence of at least 50%.

food_rules <- apriori(Food, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 190 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(food_rules)

## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01012   Min.   :0.5152   Min.   :0.01735   Min.   : 3.199  
##  1st Qu.:0.01290   1st Qu.:0.5516   1st Qu.:0.02115   1st Qu.: 3.534  
##  Median :0.01594   Median :0.6398   Median :0.02432   Median : 7.687  
##  Mean   :0.01641   Mean   :0.6790   Mean   :0.02387   Mean   : 6.965  
##  3rd Qu.:0.01776   3rd Qu.:0.7672   3rd Qu.:0.02696   3rd Qu.: 8.331  
##  Max.   :0.02857   Max.   :0.9982   Max.   :0.02862   Max.   :13.181  
##      count      
##  Min.   :193.0  
##  1st Qu.:246.0  
##  Median :304.0  
##  Mean   :313.1  
##  3rd Qu.:338.8  
##  Max.   :545.0  
## 
## mining info:
##  data ntransactions support confidence
##  Food         19076    0.01        0.5
##                                                                              call
##  apriori(data = Food, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

How many association rules can you find that have support of at least 1% and confidence of at least 50%?
It was generate 8 rules.
What is the smallest value of lift obtained for the association rules calculated in this question?
The minimun lift was 3.199.
Based on the values of lift for each rule, are the items in each rule frequently purchased together? Or are the items seldom purchased together?

inspect(head(food_rules))

##     lhs                                    rhs                             support confidence   coverage      lift count
## [1] {Hot.Chocolate.Souvenir.RefillFood} => {Hot.Chocolate.SouvenirFood} 0.01499266  0.5596869 0.02678759 13.180972   286
## [2] {ToppingFood}                       => {Ice.Cream.ConeFood}         0.02856993  0.9981685 0.02862235  8.947868   545
## [3] {Add.CheeseFood}                    => {Soft.Pretzel..3_39Food}     0.01913399  0.6965649 0.02746907  7.601643   365
## [4] {Chicken.TendersFood}               => {French.Fries.BasketFood}    0.01729922  0.7586207 0.02280352  7.771992   330
## [5] {CheeseburgerFood}                  => {French.Fries.BasketFood}    0.01687985  0.7931034 0.02128329  8.125264   322
## [6] {GatoradeFood,                                                                                                      
##      Slice.of.PeppFood}                 => {Slice.of.CheeseFood}        0.01011743  0.5830816 0.01735165  3.620724   193

The items are frequently bought together. For example, buying chicken tender and french fries, or when you buy a chocolate souvenir refill buying the hot chocolate food. Since the lift is above 1 buy a lot we would assume this items are very frequently bought together. d) What is the average confidence obtained for the association rules calculated in this question?
The Average confidence was 0.6790. e) What is the median support obtained for the association rules calculated in this question?
the median support was 0.01290.
f) Create a plot to visualize the support, confidence, and lift of the rules calculated in this question.

library('arulesViz')
plot(food_rules)

plot(food_rules)

Choose two rules, and explain how the store could use these rules to improve sales.
For example rule [6] {GatoradeFood, Slice.of.PeppFood} => {Slice.of.CheeseFood} this tells me that a client normally buys a drink and two slices. Making a combo selling one drink and two slices together my increase the sales, in cases where the person only buys a drink and a slice or only two slices. In this case the client would be purchasing 3 items instead of 2 and consequently increasing sales. Another rule is rule number [3] {Add.CheeseFood} => {Soft.Pretzel..3_39Food} in this rule the clients normally buys soft pretzel and cheese together. This is also a good example of having a promotion for purchase of the two items together or the two items plus a drink may increase the amount of items in one purchase and the amount of sales.
#Part 2: Bagging and Random Forests

Use R Markdown and the Titanic passenger list excel file to answer all of the questions in Part 2 of this homework. This dataset contains information about all passengers (excluding crew members) on the Titanic.
1. Load the Titanic passenger list excel file into R, convert the dataset into a data frame, and convert all non-numeric columns of data into factors.

library('xlsx')
titanic <- read.xlsx("/Users/jusimioni/Desktop/Titanic passenger list.xlsx", sheetIndex = 1)
titanic <- as.data.frame(titanic)

colnames(titanic)

## [1] "Passenger_Class"   "Survived"          "Sex"              
## [4] "Sibling_or_Spouse" "Parent_or_Child"   "Fare"             
## [7] "Embarked"

Survived

titanic$Survived = as.factor(titanic$Survived)
levels(titanic$Survived)

## [1] "Died"  "Lived"

titanic$Survived=as.numeric(titanic$Survived,"Lived"=1, "Died"=2)

Sex

titanic$Sex = as.factor(titanic$Sex)
levels(titanic$Sex)

## [1] "female" "male"

titanic$Sex=as.numeric(titanic$Sex,"female"=1, "Male"=0)

Embarked

titanic$Embarked = as.factor(titanic$Embarked)
levels(titanic$Embarked)

## [1] "Cherbourg"    "Queenstown"   "Southhampton"

titanic$Embarked=as.numeric(titanic$Embarked,"Cherbourg"=1, "Queenstown"=2, "Southampton"=3)

Check for missing values in the dataset.

sum(is.na(titanic))

## [1] 3

sum(is.na(titanic$Survived))

## [1] 0

sum(is.na(titanic$Sex))

## [1] 0

sum(is.na(titanic$Sibling_or_Spouse))

## [1] 0

sum(is.na(titanic$Parent_or_Child))

## [1] 0

sum(is.na(titanic$Fare))

## [1] 1

sum(is.na(titanic$Embarked))

## [1] 2

There is one missing value for Fare and two missing values for Embarked.

Create a table and/or graph displaying the number of passengers who embarked from each location (i.e., the number of passengers who embarked from Cherbourg, the number who embarked from Queenstown, etc.). From which location did the largest number of passengers embark? Impute all missing values in the Embarked column with this location.

hist(titanic$Embarked)

The largest amount of passengers embarked from number 3 that is Southampton.

titanic$Embarked[is.na(titanic$Embarked)] <- 3

For any missing value(s) in a numeric column(s) of data, impute the average of the column.

titanic$Fare[is.na(titanic$Fare)]<-mean(titanic$Fare,na.rm=TRUE)

Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.

index <- sample(nrow(titanic), nrow(titanic)*0.90)
titanic_train <- titanic[index,]
titanic_test <- titanic[-index,]

Using 300 bootstrapped sets, develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

library(ipred)
titanic_bag <- bagging(formula = Survived~., data = titanic_train,coob =T, nbagg = 300) 

titanic_bag

## 
## Bagging regression trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = titanic_train, 
##     coob = T, nbagg = 300)
## 
## Out-of-bag estimate of root mean squared error:  0.3798

What is the out-of-bag error for your model?
The OOB error is 0.3785.
Use your model to make predictions for the observations in your testing set.
The MSE for the prediction model is 0.1600606.

titanic_tree_pred <- predict(titanic_bag, newdata = titanic_test)
mean((titanic_test$Survived - titanic_tree_pred)^2)

## [1] 0.1517067

ntree <- c(1, 3, 5, seq(20, 300, 18)) 
MSE_test <- rep(0, length(ntree))
for(i in 1:length(ntree)){
  titanic_bag1 <- bagging(Survived~., data = titanic_train, nbagg = ntree[i])
  titanic_bag_pred1 <- predict(titanic_bag1, newdata = titanic_test) 
  MSE_test[i] <- mean((titanic_test$Survived - titanic_bag_pred1)^2) 
}
plot(ntree, MSE_test, type = 'l', col = 2, lwd = 2, xaxt = "n")
axis(1, at = ntree, las = 1)

5. Using 300 trees, develop a random forest model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

titanic_rf <- randomForest(Survived~., data = titanic_train, importance = TRUE, ntree = 300)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

titanic_rf # mean of square residuals in the output is the out-of-bag mean squared error

## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_train, importance = TRUE,      ntree = 300) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1397338
##                     % Var explained: 40.91

What is the out-of-bag error for your model?
The OOB for random forest is the mean of squared residual what in this case equals 0.1373
Based on the out-of-bag error, is the random forest better at predicting survival than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?
The random forest model has a a smaller error than the bagging model. The Bagging model has a error of 0.3 where random florest is presenting an error of 0.13, but random florest does not have a high explanation of the model since it only expalins 41.78% of the model.
What is the false positive rate (for your random forest model developed with 300 trees)?

titanic_rf2 <- randomForest(as.factor(Survived)~.,
                          data = titanic_train,
                          importance = TRUE, ntree = 300) 
titanic_rf2

## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ ., data = titanic_train,      importance = TRUE, ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.71%
## Confusion matrix:
##     1   2 class.error
## 1 660  66  0.09090909
## 2 178 274  0.39380531

The false positive was 74 cases (74/(942)) = 0.07855 or 7.85%
d) What is the false negative rate (for your random forest model developed with 300 tress)?
The false negative was 162 cases (162/942) = 0.1719 or 17.19%

Create a plot comparing the out-of-bag error, the false positive rate, the false negative rate, and the number of trees. Based on your plot, does it appear that a large number of trees are needed to develop a fairly accurate random forest model?

plot(titanic_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))

f) Use your random forest model to make predictions for the observations in your testing set. The random Florest prediction 0.1644, the bagging had a sligthly smaller prediction than random florest (bagging error 0.1600).

titanic_pred <- predict(titanic_rf, newdata = titanic_test)
mean((titanic_test$Survived - titanic_pred)^2)

## [1] 0.1444436

Homework 5

Julia Simioni

2022-11-18