Milestone #5

Directions:

In this assignment you will practice using classification methods on data from your project! This assignment should be submitted via R Markdown.

Step 1: Categorical response

Now either use a different response variable that is categorical or dichotomize a numeric variable so that it is transformed.

What is this variable? Please describe it.

I will be using the island categorical variable. This categorical variable has factor of 3 levels. It represents

y = {

1 if Biscoe
2 if Dream
3 if Torgersen

}

Step 2: Classification Models

Perform a classification analysis (using logistic regression or trees) on these data

I did not use anova, I used class. I was aware of my choice. anova did not work with my categorical variables very well and made everything numeric. which was not helpful for my data set. I just wanted to let you know I made a conscious choice.

Create an appropriate graphic that illustrates the different groups you wish to categorize using color. You will also be using other variable(s) on the x and y axes to act as explanatory variables. There is a lot of flexibility here to be creative!

new_pen <- penguins %>%
  select(c(2:8))

tree.penguins <- rpart(island~. -year, data = new_pen, method ="class") 

printcp(tree.penguins) # display the results

## 
## Classification tree:
## rpart(formula = island ~ . - year, data = new_pen, method = "class")
## 
## Variables actually used in tree construction:
## [1] bill_depth_mm     bill_length_mm    flipper_length_mm
## 
## Root node error: 176/344 = 0.51163
## 
## n= 344 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.409091      0   1.00000 1.00000 0.052677
## 2 0.019886      1   0.59091 0.62500 0.049149
## 3 0.017045      3   0.55114 0.71591 0.050772
## 4 0.014205      5   0.51705 0.72159 0.050856
## 5 0.010000      7   0.48864 0.68750 0.050321

plotcp(tree.penguins) # visualize cross-validation results

#summary(tree.penguins)


rpart.plot(tree.penguins, type = 1,
   main="Classification Tree for Island")

Fit your chosen model. Please include the code output
Describe your model in the context of these data. What do the coefficients (or branches) represent and how do they affect change on the response?

In my decision tree it looks like the order of Variable importance to my responds variable goes from descending order. First we have Flipper Length which show the highest importance, to then bill depth, body mass, then bill length. Flipper length has the most effect of the responds variable Island. It looks like the largest flipper length also has influence on the bill depth on the island of Biscoe and Dream. I find this interesting because we can then say that these islands of penguins have very large flippers bu small bill depth. We can then approach this data in interesting ways as well. My mind goes to does that mean the size of food they eat is smaller/ or needs to be dismantled in some way? I would like to see if the weight is correlated in some ways as well? Or since they have larger flippers can they swim better and catch quicker prey than a penguin that does not? I also see that the Dream island has a lot of variance in all the explanatory variables. It makes me wonder…. I would maybe want to see if that is because of their gender, or difference species. It looks like decision trees can help identify what explanatory variables are important and then I could see then using those variables to investigate further into this data set.

Step 3: Assessing Model Fit and Error Rate

Randomly split your data into a test and training set
Fit the model on the training set
Create a confusion matrix to assess the error rate of your model(s) on the testing set

set.seed(1)

train<-sample(1:nrow(penguins), nrow(penguins)/2)

  
tree.penguins2 <- rpart(island~., data = new_pen, subset=train, method ="class")  

#summary(tree.penguins2)

printcp(tree.penguins2)

## 
## Classification tree:
## rpart(formula = island ~ ., data = new_pen, subset = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] bill_length_mm    body_mass_g       flipper_length_mm
## 
## Root node error: 95/172 = 0.55233
## 
## n= 172 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.473684      0   1.00000 1.00000 0.068647
## 2 0.042105      1   0.52632 0.62105 0.065536
## 3 0.010000      6   0.31579 0.54737 0.063402

#Using cross-validation to select complexity
plotcp(tree.penguins2)

which.min(tree.penguins2$cptable[,"xerror"])

## 3 
## 3

tree.penguins2$cptable[which.min(tree.penguins2$cptable[,"xerror"]),"CP"]

## [1] 0.01

pfit<- prune(tree.penguins2, cp=0.01) # from cptable   

rpart.plot(pfit, type = 2,
   main="Classification Tree for Island")

Milestone #5

Rebecca Barbanell

11/22/2021