Blog 4 Data 621

Blog Entry 4: Logistic regression with Iris dataset

Title: Logistic regression with Iris dataset

Overview

In this blog post, we explored the application of logistic regression using the famous iris dataset. Logistic regression is a powerful statistical method used for binary classification tasks, where the goal is to predict the probability of an observation belonging to a particular class. We began by loading the iris dataset, which contains measurements of iris flowers along with their species. After visualizing the dataset, we preprocessed the data by encoding the target variable into binary values, splitting the data into training and testing sets, and training a logistic regression model using the training data. We then evaluated the model’s performance on the testing set by computing the confusion matrix. Finally, we discussed some real-life applications of logistic regression, highlighting its versatility in solving various classification problems.

#Loading libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
#install.packages("caret")
library(caret)

## Loading required package: lattice

#Loading iris dataset
data(iris)

#Exploring the structure of the dataset
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

#Visualization of the dataset
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  labs(title = "Sepal Length vs Sepal Width by Species")

#Data preprocessing

#Lets encode the target variable (Species) into binary values
iris$Species <- as.factor(ifelse(iris$Species == "setosa", "setosa", "other"))

#Splitting the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = .8, 
                                  list = FALSE, 
                                  times = 1)
iris_train <- iris[ trainIndex,]
iris_test  <- iris[-trainIndex,]

#Training the logistic regression model
log_model <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
                  data = iris_train, family = "binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#Model summary
summary(log_model)

## 
## Call:
## glm(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width, family = "binomial", data = iris_train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -14.762 538087.652       0        1
## Sepal.Length     11.391 150807.997       0        1
## Sepal.Width       7.663  62644.225       0        1
## Petal.Length    -19.949 121563.142       0        1
## Petal.Width     -22.004 167910.701       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.5276e+02  on 119  degrees of freedom
## Residual deviance: 2.6529e-09  on 115  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

#Making predictions on the testing set
predictions <- predict(log_model, iris_test, type = "response")

#Converting predicted probabilities to class labels
predicted_classes <- ifelse(predictions > 0.5, "setosa", "other")

#Evaluating model performance
confusionMatrix(table(predicted_classes, iris_test$Species))

## Confusion Matrix and Statistics
## 
##                  
## predicted_classes other setosa
##            other     20      0
##            setosa     0     10
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 5.215e-06  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6667     
##          Detection Rate : 0.6667     
##    Detection Prevalence : 0.6667     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : other      
##

#Real-life applications
#Logistic regression can be used in various real-life scenarios such as predicting customer churn, detecting fraudulent transactions, or classifying email spam.

Conclusion

Logistic regression is a valuable tool in the field of statistics and machine learning, particularly for binary classification tasks. In this blog post, we demonstrated how logistic regression can be applied using the iris dataset to predict the species of iris flowers based on their measurements. By training a logistic regression model and evaluating its performance, we gained insights into the predictive capabilities of the model. Furthermore, we discussed real-life applications of logistic regression, emphasizing its significance in solving practical classification problems across different domains. Overall, this blog post serves as a practical introduction to logistic regression and its application in real-life scenarios, showcasing its relevance and usefulness in statistical analysis and predictive modeling.

LS0tDQp0aXRsZTogIkJsb2cgNCBEYXRhIDYyMSINCmF1dGhvcjogIkxhdXJhIEIiDQpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiDQpvdXRwdXQ6IG9wZW5pbnRybzo6bGFiX3JlcG9ydA0KLS0tDQoNCiMjIEJsb2cgRW50cnkgNDogTG9naXN0aWMgcmVncmVzc2lvbiB3aXRoIElyaXMgZGF0YXNldA0KDQoNClRpdGxlOiBMb2dpc3RpYyByZWdyZXNzaW9uIHdpdGggSXJpcyBkYXRhc2V0DQoNCiMjIyBPdmVydmlldw0KDQpJbiB0aGlzIGJsb2cgcG9zdCwgd2UgZXhwbG9yZWQgdGhlIGFwcGxpY2F0aW9uIG9mIGxvZ2lzdGljIHJlZ3Jlc3Npb24gdXNpbmcgdGhlIGZhbW91cyBpcmlzIGRhdGFzZXQuIExvZ2lzdGljIHJlZ3Jlc3Npb24gaXMgYSBwb3dlcmZ1bCBzdGF0aXN0aWNhbCBtZXRob2QgdXNlZCBmb3IgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHRhc2tzLCB3aGVyZSB0aGUgZ29hbCBpcyB0byBwcmVkaWN0IHRoZSBwcm9iYWJpbGl0eSBvZiBhbiBvYnNlcnZhdGlvbiBiZWxvbmdpbmcgdG8gYSBwYXJ0aWN1bGFyIGNsYXNzLiBXZSBiZWdhbiBieSBsb2FkaW5nIHRoZSBpcmlzIGRhdGFzZXQsIHdoaWNoIGNvbnRhaW5zIG1lYXN1cmVtZW50cyBvZiBpcmlzIGZsb3dlcnMgYWxvbmcgd2l0aCB0aGVpciBzcGVjaWVzLiBBZnRlciB2aXN1YWxpemluZyB0aGUgZGF0YXNldCwgd2UgcHJlcHJvY2Vzc2VkIHRoZSBkYXRhIGJ5IGVuY29kaW5nIHRoZSB0YXJnZXQgdmFyaWFibGUgaW50byBiaW5hcnkgdmFsdWVzLCBzcGxpdHRpbmcgdGhlIGRhdGEgaW50byB0cmFpbmluZyBhbmQgdGVzdGluZyBzZXRzLCBhbmQgdHJhaW5pbmcgYSBsb2dpc3RpYyByZWdyZXNzaW9uIG1vZGVsIHVzaW5nIHRoZSB0cmFpbmluZyBkYXRhLiBXZSB0aGVuIGV2YWx1YXRlZCB0aGUgbW9kZWwncyBwZXJmb3JtYW5jZSBvbiB0aGUgdGVzdGluZyBzZXQgYnkgY29tcHV0aW5nIHRoZSBjb25mdXNpb24gbWF0cml4LiBGaW5hbGx5LCB3ZSBkaXNjdXNzZWQgc29tZSByZWFsLWxpZmUgYXBwbGljYXRpb25zIG9mIGxvZ2lzdGljIHJlZ3Jlc3Npb24sIGhpZ2hsaWdodGluZyBpdHMgdmVyc2F0aWxpdHkgaW4gc29sdmluZyB2YXJpb3VzIGNsYXNzaWZpY2F0aW9uIHByb2JsZW1zLg0KDQpgYGB7cn0NCiNMb2FkaW5nIGxpYnJhcmllcw0KbGlicmFyeShnZ3Bsb3QyKQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkodGlkeXIpDQojaW5zdGFsbC5wYWNrYWdlcygiY2FyZXQiKQ0KbGlicmFyeShjYXJldCkNCg0KI0xvYWRpbmcgaXJpcyBkYXRhc2V0DQpkYXRhKGlyaXMpDQoNCiNFeHBsb3JpbmcgdGhlIHN0cnVjdHVyZSBvZiB0aGUgZGF0YXNldA0Kc3RyKGlyaXMpDQoNCiNWaXN1YWxpemF0aW9uIG9mIHRoZSBkYXRhc2V0DQpnZ3Bsb3QoaXJpcywgYWVzKHggPSBTZXBhbC5MZW5ndGgsIHkgPSBTZXBhbC5XaWR0aCwgY29sb3IgPSBTcGVjaWVzKSkgKw0KICBnZW9tX3BvaW50KCkgKw0KICBsYWJzKHRpdGxlID0gIlNlcGFsIExlbmd0aCB2cyBTZXBhbCBXaWR0aCBieSBTcGVjaWVzIikNCg0KI0RhdGEgcHJlcHJvY2Vzc2luZw0KDQojTGV0cyBlbmNvZGUgdGhlIHRhcmdldCB2YXJpYWJsZSAoU3BlY2llcykgaW50byBiaW5hcnkgdmFsdWVzDQppcmlzJFNwZWNpZXMgPC0gYXMuZmFjdG9yKGlmZWxzZShpcmlzJFNwZWNpZXMgPT0gInNldG9zYSIsICJzZXRvc2EiLCAib3RoZXIiKSkNCg0KI1NwbGl0dGluZyB0aGUgZGF0YSBpbnRvIHRyYWluaW5nIGFuZCB0ZXN0aW5nIHNldHMNCnNldC5zZWVkKDEyMykNCnRyYWluSW5kZXggPC0gY3JlYXRlRGF0YVBhcnRpdGlvbihpcmlzJFNwZWNpZXMsIHAgPSAuOCwgDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbGlzdCA9IEZBTFNFLCANCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICB0aW1lcyA9IDEpDQppcmlzX3RyYWluIDwtIGlyaXNbIHRyYWluSW5kZXgsXQ0KaXJpc190ZXN0ICA8LSBpcmlzWy10cmFpbkluZGV4LF0NCg0KI1RyYWluaW5nIHRoZSBsb2dpc3RpYyByZWdyZXNzaW9uIG1vZGVsDQpsb2dfbW9kZWwgPC0gZ2xtKFNwZWNpZXMgfiBTZXBhbC5MZW5ndGggKyBTZXBhbC5XaWR0aCArIFBldGFsLkxlbmd0aCArIFBldGFsLldpZHRoLCANCiAgICAgICAgICAgICAgICAgIGRhdGEgPSBpcmlzX3RyYWluLCBmYW1pbHkgPSAiYmlub21pYWwiKQ0KDQojTW9kZWwgc3VtbWFyeQ0Kc3VtbWFyeShsb2dfbW9kZWwpDQoNCiNNYWtpbmcgcHJlZGljdGlvbnMgb24gdGhlIHRlc3Rpbmcgc2V0DQpwcmVkaWN0aW9ucyA8LSBwcmVkaWN0KGxvZ19tb2RlbCwgaXJpc190ZXN0LCB0eXBlID0gInJlc3BvbnNlIikNCg0KI0NvbnZlcnRpbmcgcHJlZGljdGVkIHByb2JhYmlsaXRpZXMgdG8gY2xhc3MgbGFiZWxzDQpwcmVkaWN0ZWRfY2xhc3NlcyA8LSBpZmVsc2UocHJlZGljdGlvbnMgPiAwLjUsICJzZXRvc2EiLCAib3RoZXIiKQ0KDQojRXZhbHVhdGluZyBtb2RlbCBwZXJmb3JtYW5jZQ0KY29uZnVzaW9uTWF0cml4KHRhYmxlKHByZWRpY3RlZF9jbGFzc2VzLCBpcmlzX3Rlc3QkU3BlY2llcykpDQoNCiNSZWFsLWxpZmUgYXBwbGljYXRpb25zDQojTG9naXN0aWMgcmVncmVzc2lvbiBjYW4gYmUgdXNlZCBpbiB2YXJpb3VzIHJlYWwtbGlmZSBzY2VuYXJpb3Mgc3VjaCBhcyBwcmVkaWN0aW5nIGN1c3RvbWVyIGNodXJuLCBkZXRlY3RpbmcgZnJhdWR1bGVudCB0cmFuc2FjdGlvbnMsIG9yIGNsYXNzaWZ5aW5nIGVtYWlsIHNwYW0uDQoNCmBgYA0KDQojIyBDb25jbHVzaW9uDQoNCkxvZ2lzdGljIHJlZ3Jlc3Npb24gaXMgYSB2YWx1YWJsZSB0b29sIGluIHRoZSBmaWVsZCBvZiBzdGF0aXN0aWNzIGFuZCBtYWNoaW5lIGxlYXJuaW5nLCBwYXJ0aWN1bGFybHkgZm9yIGJpbmFyeSBjbGFzc2lmaWNhdGlvbiB0YXNrcy4gSW4gdGhpcyBibG9nIHBvc3QsIHdlIGRlbW9uc3RyYXRlZCBob3cgbG9naXN0aWMgcmVncmVzc2lvbiBjYW4gYmUgYXBwbGllZCB1c2luZyB0aGUgaXJpcyBkYXRhc2V0IHRvIHByZWRpY3QgdGhlIHNwZWNpZXMgb2YgaXJpcyBmbG93ZXJzIGJhc2VkIG9uIHRoZWlyIG1lYXN1cmVtZW50cy4gQnkgdHJhaW5pbmcgYSBsb2dpc3RpYyByZWdyZXNzaW9uIG1vZGVsIGFuZCBldmFsdWF0aW5nIGl0cyBwZXJmb3JtYW5jZSwgd2UgZ2FpbmVkIGluc2lnaHRzIGludG8gdGhlIHByZWRpY3RpdmUgY2FwYWJpbGl0aWVzIG9mIHRoZSBtb2RlbC4gRnVydGhlcm1vcmUsIHdlIGRpc2N1c3NlZCByZWFsLWxpZmUgYXBwbGljYXRpb25zIG9mIGxvZ2lzdGljIHJlZ3Jlc3Npb24sIGVtcGhhc2l6aW5nIGl0cyBzaWduaWZpY2FuY2UgaW4gc29sdmluZyBwcmFjdGljYWwgY2xhc3NpZmljYXRpb24gcHJvYmxlbXMgYWNyb3NzIGRpZmZlcmVudCBkb21haW5zLiBPdmVyYWxsLCB0aGlzIGJsb2cgcG9zdCBzZXJ2ZXMgYXMgYSBwcmFjdGljYWwgaW50cm9kdWN0aW9uIHRvIGxvZ2lzdGljIHJlZ3Jlc3Npb24gYW5kIGl0cyBhcHBsaWNhdGlvbiBpbiByZWFsLWxpZmUgc2NlbmFyaW9zLCBzaG93Y2FzaW5nIGl0cyByZWxldmFuY2UgYW5kIHVzZWZ1bG5lc3MgaW4gc3RhdGlzdGljYWwgYW5hbHlzaXMgYW5kIHByZWRpY3RpdmUgbW9kZWxpbmcuDQoNCg0K

Blog 4 Data 621

Laura B

2024-05-13

Blog Entry 4: Logistic regression with Iris dataset

Overview

Conclusion