STA141A HW4

1.

First, from the following code, I extracted my data set, test data, and training data.

iris_sub <- iris %>% filter(Species!='setosa') %>% mutate(Species=factor(Species)) 
iris_sub$Species <- factor(iris_sub$Species)
iris_test <- iris_sub %>% 
  group_by(Species) %>% slice(1:10) 
iris_train <- iris_sub %>% 
  group_by(Species) %>% slice(11:100)

2.

The class-specific means of the predictor variables for the training data are as follows:
- The class-specific mean for just Sepal.Length is 12.4875.
- Also, the class-specific mean for just Sepal.Width is 5.7275.
We can see this from the following table:

	Sepal.Length	Sepal.Width
versicolor	5.8950	2.7450
virginica	6.5925	2.9825

The misclassification error rate is 0.4, and the confusion matrix for the test data is shown in the following table:

	versicolor	virginica
versicolor	4	2
virginica	6	8

3.

After fitting my logistic regression model from the training data and using the variables Sepal.Length and Sepal.Width as predictors, I got the following answers:
1. The estimates for the model parameters are \(\hat\beta_0\) = -18.162, \(\hat\beta_1\) = 2.5495, and \(\hat\beta_2\) = 0.8173. Their respective standard errors are: 4.3571, 0.7097, 1.0109.
2. The misclassification error rate is 0.4, and the computed confusion matrix is shown here:

	versicolor	virginica
versicolor	4	2
virginica	6	8

iii. It seems that Sepal.Width does not seem to be a necessary predictor variable for the purpose of classification. The main reason is that Sepal.Width does not apprear to be a statistically significant variable because its p-value is high since we have p = 0.4188.

After fitting my logistic regression model from the training data and using only the variable Sepal.Length as a one-dimensional predictor, I got the following answers:
1. The estimates for the model parameters are \(\hat\beta_0\) = -17.1552 and \(\hat\beta_1\) = 2.7647. Their respective standard errors are: 4.1251, 0.6668.
2. The misclassification error rate is 0.4, and the computed confusion matrix is shown here:

	versicolor	virginica
versicolor	4	2
virginica	6	8

iii. Comparing my results here with those in 3(a), I have very close estimates for the model parameters for both the \(\hat\beta\)s and the standard errors in parts (a) and (b). Additionally, the misclassification error rate and the computed confusion matrices are identical in parts (a) and (b). Therefore, yes my result in 3(b)(ii) supports my answer to 3(a)(iii) since these results are so similar then this shows it is very likely the predictor Sepal.Width is not statistically significant.

4.

For k = 1, the misclassification error rate is 0.35, and the confusion matrix is shown below:

	versicolor	virginica
versicolor	5	2
virginica	5	8

For k = 5, the misclassification error rate is 0.4, and the confusion matrix is shown below:

	versicolor	virginica
versicolor	4	2
virginica	6	8

5.

First, we saw that each of the three different classification methods have the same computed confusion matrices and the same value of 0.4 for their misclassification error rates, except the k-NN classification method with k = 1 which still has close results. Therefore, each of the three methods having the same or very close results tells us that each method performed about as well as the others. Overall though, their performance was not especially great because they were each only accurate in classifying 60% to 65% and misclassified 35% to 40%.

STA141A HW4

Shilah Johnson

11/23/2020

1.

2.

3.

4.

5.