Discussion

data <- mtcars

# Creating a binary outcome variable (1 for high mpg, 0 otherwise)
data$high_efficiency <- ifelse(data$mpg > median(data$mpg), 1, 0)

# Fit logistic regression model
model <- glm(high_efficiency ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb,
             family = binomial(link = "logit"),
             data = data, 
             control = glm.control(maxit = 100)
             )

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Display model summary
summary(model)

## 
## Call:
## glm(formula = high_efficiency ~ cyl + disp + hp + drat + wt + 
##     qsec + vs + am + gear + carb, family = binomial(link = "logit"), 
##     data = data, control = glm.control(maxit = 100))
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  3.508e+02  4.241e+06       0        1
## cyl         -4.260e+01  2.882e+05       0        1
## disp         1.136e+00  3.572e+03       0        1
## hp          -8.687e-01  4.059e+03       0        1
## drat        -1.933e+01  2.610e+05       0        1
## wt          -1.087e+02  4.177e+05       0        1
## qsec         6.854e+00  1.653e+05       0        1
## vs           2.711e+01  3.244e+05       0        1
## am           2.761e+01  5.625e+05       0        1
## gear        -2.116e+01  4.110e+05       0        1
## carb         4.340e+01  2.214e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.4236e+01  on 31  degrees of freedom
## Residual deviance: 2.7158e-10  on 21  degrees of freedom
## AIC: 22
## 
## Number of Fisher Scoring iterations: 26

Interpretation of coefficients:

cyl (Number of cylinders):

For a one-unit increase in the number of cylinders, the log-odds of the event (1) is expected to decrease by approximately 42.60, holding other variables constant.

disp (Displacement):

For a one-unit increase in displacement, the log-odds of the event is expected to increase by approximately 1.14, holding other variables constant.

hp (Gross horsepower):

For a one-unit increase in horsepower, the log-odds of the event is expected to decrease by approximately 0.87, holding other variables constant.

drat (Rear axle ratio):

For a one-unit increase in the rear axle ratio, the log-odds of the event is expected to decrease by approximately 19.33, holding other variables constant.

wt (Weight):

For a one-unit increase in weight, the log-odds of the event is expected to decrease by approximately 108.70, holding other variables constant.

qsec (1/4 mile time):

For a one-unit increase in the 1/4 mile time, the log-odds of the event is expected to increase by approximately 6.85, holding other variables constant.

vs (Engine type - V-shaped or straight):

For a one-unit increase in the engine type variable, the log-odds of the event is expected to increase by approximately 27.11, holding other variables constant.

am (Transmission type - automatic or manual):

For a one-unit increase in the transmission type variable, the log-odds of the event is expected to increase by approximately 27.61, holding other variables constant.

gear (Number of forward gears):

For a one-unit increase in the number of gears, the log-odds of the event is expected to decrease by approximately 21.16, holding other variables constant.

carb (Number of carburetors):

For a one-unit increase in the number of carburetors, the log-odds of the event is expected to increase by approximately 43.40, holding other variables constant.

# Predict probabilities
predicted_probs <- predict(model, type = "response")

# Convert probabilities to binary predictions (0 or 1)
predictions <- ifelse(predicted_probs > 0.5, 1, 0)

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Ensure 'high_efficiency' is a factor
data$high_efficiency <- factor(data$high_efficiency)

# Ensure that 'predictions' has the same factor levels as 'high_efficiency'
predictions <- factor(predictions, levels = levels(data$high_efficiency))

# Create a confusion matrix
conf_matrix <- table(data$high_efficiency, predictions)

# Create a confusion matrix with row and column labels
conf_matrix <- table(data$high_efficiency, predictions, dnn = c("Actual", "Predicted"))

# Print the labeled confusion matrix
conf_matrix

##       Predicted
## Actual  0  1
##      0 17  0
##      1  0 15

# Extract values from the confusion matrix
true_positives <- conf_matrix[2, 2]
false_positives <- conf_matrix[1, 2]
true_negatives <- conf_matrix[1, 1]
false_negatives <- conf_matrix[2, 1]

# Calculate sensitivity and specificity
sensitivity <- true_positives / (true_positives + false_negatives)
specificity <- true_negatives / (true_negatives + false_positives)

# Print the results
cat("Sensitivity (True Positive Rate):", sensitivity, "\n")

## Sensitivity (True Positive Rate): 1

cat("Specificity (True Negative Rate):", specificity, "\n")

## Specificity (True Negative Rate): 1

For the mtcars dataset, the logistic regression output we got is appropriate if the goal is to predict a binary or categorical outcome. Logistic regression is well-suited for such scenarios, estimating the log-odds of the binary outcome based on the predictor variables. However, if the objective is to predict a continuous outcome, a multivariate linear regression model might be more fitting.

II.

Over the past 14 weeks, our journey through data analysis has been quite enriching. Theoretical foundations were laid with in-depth discussions on concepts like probability, Bayes theorem, and various probability distributions. We explored practical aspects of writing functions in R, utilizing APIs for data retrieval, and merging datasets to uncover correlations and covariances. Hypothesis testing and the Central Limit Theorem added valuable insights into statistical inference. Hands-on assignments, such as working with the Titanic and Iris datasets, and diving into linear regression, empowered us with the skills to analyze and interpret real-world data. The course seamlessly blended theory and practical application, providing a holistic understanding of data analysis that I can confidently apply in industrial scenarios and will work as a strong base for my further courses. Thank you.

Discussion_15

Aritra

2023-12-19