rm(list = ls()) # Clear all files from your environment
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 521109 27.9 1158690 61.9 660385 35.3
## Vcells 947300 7.3 8388608 64.0 1769723 13.6
# Load the iris dataset
data(iris)
# Split the data into train and test sets
set.seed(123)
train_indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]
# Fit logistic regression model
logreg_model <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = train_data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print the coefficient estimates
summary(logreg_model)
##
## Call:
## glm(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.220 540630.011 0 1
## Sepal.Length -12.295 152119.425 0 1
## Sepal.Width -7.802 70088.697 0 1
## Petal.Length 20.558 124693.199 0 1
## Petal.Width 21.938 178623.800 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.3501e+02 on 104 degrees of freedom
## Residual deviance: 2.3265e-09 on 100 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
It seems that the logistic regression model did not find any statistically significant relationships between the predictor variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and the response variable (Species). However, the extremely large standard errors and zero z-values suggest that the model might not be well-fitted to the data or there might be issues with the data itself.
# Predict on test set
test_data$predicted_species <- predict(logreg_model, newdata = test_data, type = "response")
# Convert predicted probabilities to predicted species
test_data$predicted_species <- ifelse(test_data$predicted_species < 0.5, "setosa", ifelse(test_data$predicted_species < 1.5, "versicolor", "virginica"))
# Evaluate the model
accuracy <- sum(test_data$predicted_species == test_data$Species) / nrow(test_data)
cat("Accuracy:", accuracy)
## Accuracy: 0.7111111
Multivariate regression is a statistical technique used to analyze the relationship between multiple independent variables and a dependent variable. It is typically employed when you have multiple predictors and want to understand how they collectively influence the outcome.
library(ggplot2)
# Create a scatter plot of the iris dataset
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
labs(x = "Sepal Length", y = "Sepal Width", title = "Iris Sepal Measurements") +
theme_minimal()
# Add predicted species from logistic regression model
ggplot() +
geom_point(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species), alpha = 0.5) +
geom_point(data = test_data, aes(x = Sepal.Length, y = Sepal.Width, color = predicted_species), shape = 2) +
labs(x = "Sepal Length", y = "Sepal Width", title = "Iris Sepal Measurements with Predicted Species") +
theme_minimal()
exp(coef(logreg_model))
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## 8.183325e+07 4.575371e-06 4.087898e-04 8.480463e+08 3.369939e+09
Reflecting on the past 14 weeks of the data analysis course work, I have learned so much about R programming. Actually, I was supposed to finish the course in fall 2023, but I can’t. But now I have a strong knowledge of data analysis. Professor Arvind Sharma guided me very well in both semesters, which is why I am confident in data analytics. We saw different types of variables, their measurements, and a lot of statistical plots for any data set. Those all gave me a new experience in R programming. Basically, I am not very interested in theory, but in this course, I really love the theory of calculation and some other topics. I would like to continue my learning in data analysis, and this course is foundation for those who are looking to build their knowledge in the data science section.