Discussion 15

Haiding Luo

2023 12 18

1. Implement the logistic regression on any dataset of your choice, and interpret your coefficients. Tell us why you should not run a multivariate regression.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

data(mtcars)
mtcars$high_mpg <- ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)
model <- glm(high_mpg ~ wt + hp + qsec + am, family = binomial, data = mtcars)

## Warning: glm.fit:算法没有聚合

## Warning: glm.fit:拟合機率算出来是数值零或一

summary(model)

## 
## Call:
## glm(formula = high_mpg ~ wt + hp + qsec + am, family = binomial, 
##     data = mtcars)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  5.500e+02  1.097e+06   0.001    1.000
## wt          -1.856e+02  1.534e+05  -0.001    0.999
## hp           6.864e-02  2.168e+03   0.000    1.000
## qsec         3.098e+00  4.619e+04   0.000    1.000
## am          -5.339e+01  1.357e+05   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.4236e+01  on 31  degrees of freedom
## Residual deviance: 3.3270e-09  on 27  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

library(ggplot2)
ggplot(mtcars, aes(x = wt, y = hp, color = mpg, size = disp)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Horsepower vs Weight ",
       x = "Weight ",
       y = "Horsepower",
       color = "(mpg)",
       size = "Displacement (cu. inches)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: The following aesthetics were dropped during statistical transformation: size
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

exp(coef(model))

##   (Intercept)            wt            hp          qsec            am 
## 7.608045e+238  2.417769e-81  1.071046e+00  2.214412e+01  6.473498e-24

The Intercept is the baseline value, and 7.608045e+238 represents the estimated change in the log-odds of the event occurring for each additional unit of weight. 2.417769e-81: This indicates the estimated decrease in the log-odds of the event occurring for each additional unit of weight. 1.071046e+00 signifies the estimated increase in the log-odds of the event occurring for each additional unit of horsepower. The quarter-mile time increase by one unit corresponds to an estimated increase in the log-odds of the event occurring. 6.473498e-24 represents the estimated impact on the log-odds of the event occurring due to a change in the type of transmission.

Why not running the multivariate regression?

Multivariate regression assumes a linear relationship between the dependent variable and the independent variables. However, this assumption of linearity often does not hold in binary classification problems. Logistic regression is designed to address this situation; it does not predict actual numerical values but rather the probability of an event occurring. Logistic regression employs a logit function, which effectively handles the categorical nature of the dependent variable.

2.REFLECTION

Reflecting on the past 14 weeks of data analysis coursework, I am amazed by the breadth and depth of topics that we have covered. The journey began with the basics of R and an understanding of different types of variables and their measurements, which laid a solid foundation for the essential statistical programming skills for any data analyst. Through the semester, I feel that I have not only mastered the theoretical knowledge of data analysis principles but also acquired practical skills in applying these principles in real-world scenarios using R. This course has been a comprehensive introduction to the world of data analysis, and I am eager to continue learning on this solid foundation.