Part1

Question 1

Load the TransFood excel file into R as a dataset of transactions. Each row in the dataset represents items purchased by a particular customer during one visit to a store. What is the largest total number of items purchased by a customer during one visit (i.e., what is the largest number of 1’s in any single row)?

library("readxl")
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
TransFood <- read_excel("G:/Other computers/My Laptop/Documents/Richard 621/Week 9/TransFood.xlsx")
TransFood <- as(as.matrix(TransFood), "transactions")
summary(TransFood)
## transactions as itemMatrix in sparse format with
##  19076 rows (elements/itemsets/transactions) and
##  118 columns (items) and a density of 0.02230729 
## 
## most frequent items:
##   Bottled.WaterFood Slice.of.CheeseFood    Medium.DrinkFood     Small.DrinkFood 
##                3166                3072                2871                2769 
##   Slice.of.PeppFood             (Other) 
##                2354               35981 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##  197 5675 5178 3253 2129 1293  655  351  178   95   42   14    8    7    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.632   4.000  15.000 
## 
## includes extended item information - examples:
##              labels
## 1    Add.CheeseFood
## 2          BeerFood
## 3 Bottled.WaterFood
# Largest total purchased by one customer was 15 items
#sizes

Question 2

Which item is most frequently bought by customers? Create a frequency plot to support your answer.

itemFrequencyPlot(TransFood, support = 0.1, cex.names = 0.8)

#bottled water

Question 3

Perform association analysis to determine rules with support of at least 1% and confidence of at least 50%.

TFoods_rules <- apriori(TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules")) # find all rules with a minimum support of 0.3% and a minimum of confidence of 50%, maxlen by default is set to 10 (meaning that you can have no more than 10 items in a rule)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 190 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 19076 transaction(s)] done [0.00s].
## sorting and recoding items ... [45 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(TFoods_rules)
## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01012   Min.   :0.5152   Min.   :0.01735   Min.   : 3.199  
##  1st Qu.:0.01290   1st Qu.:0.5516   1st Qu.:0.02115   1st Qu.: 3.534  
##  Median :0.01594   Median :0.6398   Median :0.02432   Median : 7.687  
##  Mean   :0.01641   Mean   :0.6790   Mean   :0.02387   Mean   : 6.965  
##  3rd Qu.:0.01776   3rd Qu.:0.7672   3rd Qu.:0.02696   3rd Qu.: 8.331  
##  Max.   :0.02857   Max.   :0.9982   Max.   :0.02862   Max.   :13.181  
##      count      
##  Min.   :193.0  
##  1st Qu.:246.0  
##  Median :304.0  
##  Mean   :313.1  
##  3rd Qu.:338.8  
##  Max.   :545.0  
## 
## mining info:
##       data ntransactions support confidence
##  TransFood         19076    0.01        0.5
##                                                                                   call
##  apriori(data = TransFood, parameter = list(sup = 0.01, conf = 0.5, target = "rules"))

a) How many association rules can you find that have support of at least 1% and confidence of at least 50%?

set of 8 rules

b) What is the smallest value of lift obtained for the association rules calculated in this question?

Min. : 3.199

c) Based on the values of lift for each rule, are the items in each rule frequently purchased together?

Yes, they are typically bought together since lift is above 1 in fact the minimum lift is 3.199

d) What is the average confidence obtained for the association rules calculated in this question?

Mean (avg) Confidence :0.6790

e) What is the median support obtained for the association rules calculated in this question?

Median Support :0.01594

f) Create a plot to visualize the support, confidence, and lift of the rules calculated in this question.

#install.packages('arulesViz')
library('arulesViz')
## Warning: package 'arulesViz' was built under R version 4.2.2
#plot(TFoods_rules, interactive = TRUE) Interactive see r file for plot

g) Choose two rules, and explain how the store could use these rules to improve sales.

Number of rules selected: 1

lhs rhs support confidence coverage lift count order id

[1] {ToppingFood} => {Ice.Cream.ConeFood} 0.02856993 0.9981685 0.02862235 8.947868 545 2 2

you tend to buy toppings with ice cream since the lift is 8.95, so as a store I would place a dry food end capin the freezer section that has toppings for ice cream

Number of rules selected: 1

lhs rhs support confidence coverage lift count order id

[1] {CheeseburgerFood} => {French.Fries.BasketFood} 0.01687985 0.7931034 0.02128329 8.125264 322 2 5

you tend to buy cheeseburgers and french fries together since the lift is 8.13, so as a store I would want to have a special on meat at the same time that there is a sale on cheese and french fries, also making the items easy to shop for in a condensed area in the store

Part 2

Question 1

Load the Titanic passenger list excel file into R, convert the dataset into a data frame, and convert all non-numeric columns of data into factors.

#date file loaded and changed to a dataframe
Titanic <- read_excel("G:/Other computers/My Laptop/Documents/Richard 621/Week 9/Titanic passenger list.xlsx")
Titanic<- as.data.frame(Titanic)
summary(Titanic)
##  Passenger_Class   Survived             Sex            Sibling_or_Spouse
##  Min.   :1.000   Length:1309        Length:1309        Min.   :0.0000   
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.:0.0000   
##  Median :3.000   Mode  :character   Mode  :character   Median :0.0000   
##  Mean   :2.295                                         Mean   :0.4989   
##  3rd Qu.:3.000                                         3rd Qu.:1.0000   
##  Max.   :3.000                                         Max.   :8.0000   
##                                                                         
##  Parent_or_Child      Fare           Embarked        
##  Min.   :0.000   Min.   :  0.000   Length:1309       
##  1st Qu.:0.000   1st Qu.:  7.896   Class :character  
##  Median :0.000   Median : 14.454   Mode  :character  
##  Mean   :0.385   Mean   : 33.295                     
##  3rd Qu.:0.000   3rd Qu.: 31.275                     
##  Max.   :9.000   Max.   :512.329                     
##                  NA's   :1
str(Titanic)
## 'data.frame':    1309 obs. of  7 variables:
##  $ Passenger_Class  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Survived         : chr  "Lived" "Lived" "Died" "Died" ...
##  $ Sex              : chr  "female" "male" "female" "male" ...
##  $ Sibling_or_Spouse: num  0 1 1 1 1 0 1 0 2 0 ...
##  $ Parent_or_Child  : num  0 2 2 2 2 0 0 0 0 0 ...
##  $ Fare             : num  211 152 152 152 152 ...
##  $ Embarked         : chr  "Southhampton" "Southhampton" "Southhampton" "Southhampton" ...
#changed character variables to factor variables
Titanic$Survived <- as.factor(Titanic$Survived)
Titanic$Sex <- as.factor(Titanic$Sex)
Titanic$Embarked <- as.factor(Titanic$Embarked)
str(Titanic)
## 'data.frame':    1309 obs. of  7 variables:
##  $ Passenger_Class  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Survived         : Factor w/ 2 levels "Died","Lived": 2 2 1 1 1 2 2 1 2 1 ...
##  $ Sex              : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ Sibling_or_Spouse: num  0 1 1 1 1 0 1 0 2 0 ...
##  $ Parent_or_Child  : num  0 2 2 2 2 0 0 0 0 0 ...
##  $ Fare             : num  211 152 152 152 152 ...
##  $ Embarked         : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 3 3 3 3 3 3 1 ...

Question 2

Check for missing values in the dataset

#Fare has one missing value and Embarked has two missing values
sum(is.na(Titanic))
## [1] 3
sapply(Titanic, function(x) sum(is.na(x)))
##   Passenger_Class          Survived               Sex Sibling_or_Spouse 
##                 0                 0                 0                 0 
##   Parent_or_Child              Fare          Embarked 
##                 0                 1                 2

a) Create a table and/or graph displaying the number of passengers who embarked from each location (i.e., the number of passengers who embarked from Cherbourg, the number who embarked from Queenstown, etc.). From which location did the largest number of passengers embark? Impute all missing values in the Embarked column with this location.

#created a barplot to show locations and number of people who embarked from those locations
barplot(table(Titanic$Embarked), col="green", ylab ="Number of Passengers", xlab ="Embarked Location")

#Southhapmpton has the largest amount of passengers at over 800 used the code below to impute the missing values to Southhampton
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
Titanic$Embarked <- with(Titanic, impute(Embarked, 'Southhampton'))

b) For any missing value(s) in a numeric column(s) of data, impute the average of the column.

#Fare had a missing value so it was imputed with the mean and the last line of code show no more missing values
Titanic$Fare[is.na(Titanic$Fare)] <- mean(Titanic$Fare, na.rm = T)
sum(is.na(Titanic)) # no more missng values in any column
## [1] 0

Question 3

Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.

#sampled 90% of training set
index <- sample(nrow(Titanic), nrow(Titanic)*0.90)
Titanic_train <- Titanic[index,]
Titanic_test <- Titanic[-index,]

Question 4

Using 300 bootstrapped sets, develop a bagging model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

#install.packages('ipred')
library(ipred)
## Warning: package 'ipred' was built under R version 4.2.1
Titanic_bag <- bagging(formula = Survived~., data = Titanic_train, nbagg = 300)
Titanic_bag
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = Titanic_train, 
##     nbagg = 300)

a) What is the out-of-bag error for your model?

Titanic_bag_oob <- bagging(formula = Survived~.,
                          data = Titanic_train,
                          coob = T,
                          nbagg = 300)
Titanic_bag_oob
## 
## Bagging classification trees with 300 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Survived ~ ., data = Titanic_train, 
##     coob = T, nbagg = 300)
## 
## Out-of-bag estimate of misclassification error:  0.208
#Out-of-bag estimate of misclassification error:  0.2037 when first ran

b) Use your model to make predictions for the observations in your testing set.

Titanic_bag_pred <- predict(Titanic_bag, newdata = Titanic_test, type= "class")
table(Titanic_test$Survived, Titanic_bag_pred, dnn = c("Truth", "Predicted"))
##        Predicted
## Truth   Died Lived
##   Died    70     8
##   Lived   18    35
# below table obtained when first sampled
#Predicted
#Truth   Died Lived
#Died    76    15
#Lived   14    26

Quesiton 5

Using 300 trees, develop a random forest model on your training set to predict whether or not a passenger survived the sinking of the Titanic.

#300 trees used to make random forest model
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
Titanic_rf <- randomForest(Survived~., data = Titanic_train, importance = TRUE, ntree = 300)
Titanic_rf
## 
## Call:
##  randomForest(formula = Survived ~ ., data = Titanic_train, importance = TRUE,      ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.44%
## Confusion matrix:
##       Died Lived class.error
## Died   671    60  0.08207934
## Lived  169   278  0.37807606
#when first ran below was what was returned
#Call:
#randomForest(formula = Survived ~ ., data = Titanic_train, importance = TRUE,      ntree = 300) 
#Type of random forest: classification
#Number of trees: 300
#No. of variables tried at each split: 2

#OOB estimate of  error rate: 19.61%
#Confusion matrix:
        #Died Lived class.error
#Died   669    66  0.08979592
#Lived  165   278  0.37246050

a) What is the out-of-bag error for your model?

OOB estimate of error rate: 19.61%

b) Based on the out-of-bag error, is the random forest better at predicting survival than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?

The bagging model is better using the out of bag error (bagging model is = to 0.2037 and the random forest model is 19.61)

c) What is the false positive rate (for your random forest model developed with 300 trees)?

False positive = 0.08979592

d) What is the false negative rate (for your random forest model developed with 300 tress)?

False negative = 0.37246050

e) Create a plot comparing the out-of-bag error, the false positive rate, the false negative rate, and the number of trees. Based on your plot, does it appear that a large number of trees are needed to develop a fairly accurate random forest model?

#No, less than 50 trees would be fine as there is not much movement past 35
plot(Titanic_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd = rep(2, 3), lty = c(1, 2, 3), col = c("black", "red", "green"))

f) Use your random forest model to make predictions for the observations in your testing set.

Titanic_rf_pred <- predict(Titanic_rf, newdata = Titanic_test, type= "class")
table(Titanic_test$Survived, Titanic_rf_pred, dnn = c("Truth", "Predicted"))
##        Predicted
## Truth   Died Lived
##   Died    71     7
##   Lived   20    33
#Results obtained when first ran
#Predicted
#Truth   Died Lived
#Died    79    12
#Lived   16    24