Directions:

In this assignment you will practice using classification methods on data from your project! This assignment should be submitted via R Markdown.

Step 1: Categorical response

Now either use a different response variable that is categorical or dichotomize a numeric variable so that it is transformed.

I will be using the island categorical variable. This categorical variable has factor of 3 levels. It represents

y = {

}

Step 2: Classification Models

Perform a classification analysis (using logistic regression or trees) on these data

I did not use anova, I used class. I was aware of my choice. anova did not work with my categorical variables very well and made everything numeric. which was not helpful for my data set. I just wanted to let you know I made a conscious choice.

new_pen <- penguins %>%
  select(c(2:8))

tree.penguins <- rpart(island~. -year, data = new_pen, method ="class") 

printcp(tree.penguins) # display the results
## 
## Classification tree:
## rpart(formula = island ~ . - year, data = new_pen, method = "class")
## 
## Variables actually used in tree construction:
## [1] bill_depth_mm     bill_length_mm    flipper_length_mm
## 
## Root node error: 176/344 = 0.51163
## 
## n= 344 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.409091      0   1.00000 1.00000 0.052677
## 2 0.019886      1   0.59091 0.62500 0.049149
## 3 0.017045      3   0.55114 0.71591 0.050772
## 4 0.014205      5   0.51705 0.72159 0.050856
## 5 0.010000      7   0.48864 0.68750 0.050321
plotcp(tree.penguins) # visualize cross-validation results

#summary(tree.penguins)


rpart.plot(tree.penguins, type = 1,
   main="Classification Tree for Island")

In my decision tree it looks like the order of Variable importance to my responds variable goes from descending order. First we have Flipper Length which show the highest importance, to then bill depth, body mass, then bill length. Flipper length has the most effect of the responds variable Island. It looks like the largest flipper length also has influence on the bill depth on the island of Biscoe and Dream. I find this interesting because we can then say that these islands of penguins have very large flippers bu small bill depth. We can then approach this data in interesting ways as well. My mind goes to does that mean the size of food they eat is smaller/ or needs to be dismantled in some way? I would like to see if the weight is correlated in some ways as well? Or since they have larger flippers can they swim better and catch quicker prey than a penguin that does not? I also see that the Dream island has a lot of variance in all the explanatory variables. It makes me wonder…. I would maybe want to see if that is because of their gender, or difference species. It looks like decision trees can help identify what explanatory variables are important and then I could see then using those variables to investigate further into this data set.

Step 3: Assessing Model Fit and Error Rate

set.seed(1)

train<-sample(1:nrow(penguins), nrow(penguins)/2)

  
tree.penguins2 <- rpart(island~., data = new_pen, subset=train, method ="class")  

#summary(tree.penguins2)

printcp(tree.penguins2)
## 
## Classification tree:
## rpart(formula = island ~ ., data = new_pen, subset = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] bill_length_mm    body_mass_g       flipper_length_mm
## 
## Root node error: 95/172 = 0.55233
## 
## n= 172 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.473684      0   1.00000 1.00000 0.068647
## 2 0.042105      1   0.52632 0.62105 0.065536
## 3 0.010000      6   0.31579 0.54737 0.063402
#Using cross-validation to select complexity
plotcp(tree.penguins2)

which.min(tree.penguins2$cptable[,"xerror"])
## 3 
## 3
tree.penguins2$cptable[which.min(tree.penguins2$cptable[,"xerror"]),"CP"]
## [1] 0.01
pfit<- prune(tree.penguins2, cp=0.01) # from cptable   

rpart.plot(pfit, type = 2,
   main="Classification Tree for Island")