Haiding Luo
2023 12 18
1. Implement the logistic regression on any dataset of your choice, and interpret your coefficients. Tell us why you should not run a multivariate regression.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
data(mtcars)
mtcars$high_mpg <- ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)
model <- glm(high_mpg ~ wt + hp + qsec + am, family = binomial, data = mtcars)
## Warning: glm.fit:算法没有聚合
## Warning: glm.fit:拟合機率算出来是数值零或一
summary(model)
##
## Call:
## glm(formula = high_mpg ~ wt + hp + qsec + am, family = binomial,
## data = mtcars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.500e+02 1.097e+06 0.001 1.000
## wt -1.856e+02 1.534e+05 -0.001 0.999
## hp 6.864e-02 2.168e+03 0.000 1.000
## qsec 3.098e+00 4.619e+04 0.000 1.000
## am -5.339e+01 1.357e+05 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.4236e+01 on 31 degrees of freedom
## Residual deviance: 3.3270e-09 on 27 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = hp, color = mpg, size = disp)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Horsepower vs Weight ",
x = "Weight ",
y = "Horsepower",
color = "(mpg)",
size = "Displacement (cu. inches)") +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: size
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
exp(coef(model))
## (Intercept) wt hp qsec am
## 7.608045e+238 2.417769e-81 1.071046e+00 2.214412e+01 6.473498e-24
The Intercept is the baseline value, and 7.608045e+238 represents the estimated change in the log-odds of the event occurring for each additional unit of weight. 2.417769e-81: This indicates the estimated decrease in the log-odds of the event occurring for each additional unit of weight. 1.071046e+00 signifies the estimated increase in the log-odds of the event occurring for each additional unit of horsepower. The quarter-mile time increase by one unit corresponds to an estimated increase in the log-odds of the event occurring. 6.473498e-24 represents the estimated impact on the log-odds of the event occurring due to a change in the type of transmission.
Why not running the multivariate regression?
Multivariate regression assumes a linear relationship between the dependent variable and the independent variables. However, this assumption of linearity often does not hold in binary classification problems. Logistic regression is designed to address this situation; it does not predict actual numerical values but rather the probability of an event occurring. Logistic regression employs a logit function, which effectively handles the categorical nature of the dependent variable.
2.REFLECTION
Reflecting on the past 14 weeks of data analysis coursework, I am amazed by the breadth and depth of topics that we have covered. The journey began with the basics of R and an understanding of different types of variables and their measurements, which laid a solid foundation for the essential statistical programming skills for any data analyst. Through the semester, I feel that I have not only mastered the theoretical knowledge of data analysis principles but also acquired practical skills in applying these principles in real-world scenarios using R. This course has been a comprehensive introduction to the world of data analysis, and I am eager to continue learning on this solid foundation.