rm(list = ls())      # Clear all files from your environment
         gc() 
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 521109 27.9    1158690 61.9   660385 35.3
## Vcells 947300  7.3    8388608 64.0  1769723 13.6

A. Implement the logistic regression on any dataset of your choice, and interpret your coefficients.

# Load the iris dataset
data(iris)
# Split the data into train and test sets
set.seed(123)
train_indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]
# Fit logistic regression model
logreg_model <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = train_data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print the coefficient estimates
summary(logreg_model)
## 
## Call:
## glm(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width, family = binomial, data = train_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)      18.220 540630.011       0        1
## Sepal.Length    -12.295 152119.425       0        1
## Sepal.Width      -7.802  70088.697       0        1
## Petal.Length     20.558 124693.199       0        1
## Petal.Width      21.938 178623.800       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3501e+02  on 104  degrees of freedom
## Residual deviance: 2.3265e-09  on 100  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

It seems that the logistic regression model did not find any statistically significant relationships between the predictor variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and the response variable (Species). However, the extremely large standard errors and zero z-values suggest that the model might not be well-fitted to the data or there might be issues with the data itself.

# Predict on test set
test_data$predicted_species <- predict(logreg_model, newdata = test_data, type = "response")

# Convert predicted probabilities to predicted species
test_data$predicted_species <- ifelse(test_data$predicted_species < 0.5, "setosa", ifelse(test_data$predicted_species < 1.5, "versicolor", "virginica"))

# Evaluate the model
accuracy <- sum(test_data$predicted_species == test_data$Species) / nrow(test_data)
cat("Accuracy:", accuracy)
## Accuracy: 0.7111111

Tell us why you should not run a multivariate regression.

Multivariate regression is a statistical technique used to analyze the relationship between multiple independent variables and a dependent variable. It is typically employed when you have multiple predictors and want to understand how they collectively influence the outcome.

library(ggplot2)

# Create a scatter plot of the iris dataset
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  labs(x = "Sepal Length", y = "Sepal Width", title = "Iris Sepal Measurements") +
  theme_minimal()

# Add predicted species from logistic regression model
ggplot() +
  geom_point(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species), alpha = 0.5) +
  geom_point(data = test_data, aes(x = Sepal.Length, y = Sepal.Width, color = predicted_species), shape = 2) +
  labs(x = "Sepal Length", y = "Sepal Width", title = "Iris Sepal Measurements with Predicted Species") +
  theme_minimal()

exp(coef(logreg_model))
##  (Intercept) Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
## 8.183325e+07 4.575371e-06 4.087898e-04 8.480463e+08 3.369939e+09

II. REFLECTION

Please reflect over the last 14 weeks - maybe even skim over the material that we have seen, to consolidate the topics we have seen in class. What have your learned about data analysis - both theoretically and empirically?

Reflecting on the past 14 weeks of the data analysis course work, I have learned so much about R programming. Actually, I was supposed to finish the course in fall 2023, but I can’t. But now I have a strong knowledge of data analysis. Professor Arvind Sharma guided me very well in both semesters, which is why I am confident in data analytics. We saw different types of variables, their measurements, and a lot of statistical plots for any data set. Those all gave me a new experience in R programming. Basically, I am not very interested in theory, but in this course, I really love the theory of calculation and some other topics. I would like to continue my learning in data analysis, and this course is foundation for those who are looking to build their knowledge in the data science section.