Directions:
In this assignment you will practice using classification methods on data from your project! This assignment should be submitted via R Markdown.
Step 1: Categorical response
Now either use a different response variable that is categorical or dichotomize a numeric variable so that it is transformed.
I will be using the island categorical variable. This categorical variable has factor of 3 levels. It represents
y = {
1 if Biscoe
2 if Dream
3 if Torgersen
}
Step 2: Classification Models
Perform a classification analysis (using logistic regression or trees) on these data
I did not use anova, I used class. I was aware of my choice. anova did not work with my categorical variables very well and made everything numeric. which was not helpful for my data set. I just wanted to let you know I made a conscious choice.
new_pen <- penguins %>%
select(c(2:8))
tree.penguins <- rpart(island~. -year, data = new_pen, method ="class")
printcp(tree.penguins) # display the results
##
## Classification tree:
## rpart(formula = island ~ . - year, data = new_pen, method = "class")
##
## Variables actually used in tree construction:
## [1] bill_depth_mm bill_length_mm flipper_length_mm
##
## Root node error: 176/344 = 0.51163
##
## n= 344
##
## CP nsplit rel error xerror xstd
## 1 0.409091 0 1.00000 1.00000 0.052677
## 2 0.019886 1 0.59091 0.62500 0.049149
## 3 0.017045 3 0.55114 0.71591 0.050772
## 4 0.014205 5 0.51705 0.72159 0.050856
## 5 0.010000 7 0.48864 0.68750 0.050321
plotcp(tree.penguins) # visualize cross-validation results
#summary(tree.penguins)
rpart.plot(tree.penguins, type = 1,
main="Classification Tree for Island")
Fit your chosen model. Please include the code output
Describe your model in the context of these data. What do the coefficients (or branches) represent and how do they affect change on the response?
In my decision tree it looks like the order of Variable importance to my responds variable goes from descending order. First we have Flipper Length which show the highest importance, to then bill depth, body mass, then bill length. Flipper length has the most effect of the responds variable Island. It looks like the largest flipper length also has influence on the bill depth on the island of Biscoe and Dream. I find this interesting because we can then say that these islands of penguins have very large flippers bu small bill depth. We can then approach this data in interesting ways as well. My mind goes to does that mean the size of food they eat is smaller/ or needs to be dismantled in some way? I would like to see if the weight is correlated in some ways as well? Or since they have larger flippers can they swim better and catch quicker prey than a penguin that does not? I also see that the Dream island has a lot of variance in all the explanatory variables. It makes me wonder…. I would maybe want to see if that is because of their gender, or difference species. It looks like decision trees can help identify what explanatory variables are important and then I could see then using those variables to investigate further into this data set.
Step 3: Assessing Model Fit and Error Rate
Randomly split your data into a test and training set
Fit the model on the training set
Create a confusion matrix to assess the error rate of your model(s) on the testing set
set.seed(1)
train<-sample(1:nrow(penguins), nrow(penguins)/2)
tree.penguins2 <- rpart(island~., data = new_pen, subset=train, method ="class")
#summary(tree.penguins2)
printcp(tree.penguins2)
##
## Classification tree:
## rpart(formula = island ~ ., data = new_pen, subset = train, method = "class")
##
## Variables actually used in tree construction:
## [1] bill_length_mm body_mass_g flipper_length_mm
##
## Root node error: 95/172 = 0.55233
##
## n= 172
##
## CP nsplit rel error xerror xstd
## 1 0.473684 0 1.00000 1.00000 0.068647
## 2 0.042105 1 0.52632 0.62105 0.065536
## 3 0.010000 6 0.31579 0.54737 0.063402
#Using cross-validation to select complexity
plotcp(tree.penguins2)
which.min(tree.penguins2$cptable[,"xerror"])
## 3
## 3
tree.penguins2$cptable[which.min(tree.penguins2$cptable[,"xerror"]),"CP"]
## [1] 0.01
pfit<- prune(tree.penguins2, cp=0.01) # from cptable
rpart.plot(pfit, type = 2,
main="Classification Tree for Island")