1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
mtcars %>%
group_by(vs) %>%
summarise(avg_disp = mean(disp), sd_disp = sd(disp))
## # A tibble: 2 × 3
## vs avg_disp sd_disp
## <dbl> <dbl> <dbl>
## 1 0 307. 107.
## 2 1 132. 56.9
model1 <- glm(formula = vs ~ cyl+hp+wt+disp,
family = binomial(link = "logit"),
data = mtcars)
# regression output
summary(model1)
##
## Call:
## glm(formula = vs ~ cyl + hp + wt + disp, family = binomial(link = "logit"),
## data = mtcars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.902672 5.045309 1.765 0.0776 .
## cyl -2.083545 1.555981 -1.339 0.1806
## hp -0.046461 0.037354 -1.244 0.2136
## wt 3.402321 2.252263 1.511 0.1309
## disp -0.007595 0.019185 -0.396 0.6922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.860 on 31 degrees of freedom
## Residual deviance: 12.538 on 27 degrees of freedom
## AIC: 22.538
##
## Number of Fisher Scoring iterations: 7
exp(-2.083545)
## [1] 0.1244881
exp(-0.046461)
## [1] 0.9546018
exp(3.402321)
## [1] 30.03373
exp(-0.007595)
## [1] 0.9924338
The coefficients mean: increasing the number of cylinders by 1 unit multiplies the odds of having the engine straight by 0.1245; increasing the gross horsepower by 1 unit multiplies the odds of having the engine straight by 0.9546 increasing the number of weight by 1 unit multiplies the odds of having the engine straight by 30.0337 increasing thedisplacement by 1 unit multiplies the odds of having the engine straight by 0.9924.
We should not run a mulitple linear regression since the outcome is binary, linear regression will give the prediction of a specific number such as 0.3256 but not the probability of the outcome being 0 or 1.
First, load the necessary packages to the environment, then load the functions that we create. Then load the data files and check missing values, create a new dataframe that summarizes the bankruptcies and visualize it. Do some exploratory data analysis, check the summary statistics of data, check the correlations of variable. Split the dataset to training and testing dataset, rebalance the data due to the highly unbalanced distribution, using glmnet function to build to logistic regression model, select different variables to be included in the model, build three different models and compare it.
Review on the last 14 weeks, we can find that data analysis depends on both theoretical knowledge and empirical experience. Before the analysis, we should have a frame to conduct the analysis, that mainly depends on our theory knowledge, we should consider the potential distribution of data, find the proper models for the data, it usually depends on our aim: if we want to predict a numerial variable, we need regression model, if we want to find the group structure, we need cluster methods. Theory would also tell us how to explain our model and how to conduct test to the model.
However, once we start to do the analysis, there are many stuff beyond theory, we should adjust the plan with the data, for example, if there are many missing values, we should deal with them. Sometimes, the data might not meet our assumptions such as normality, we should make some adjustments to the data or change our target model. Also, we might get some insignificant results, it would need us to change the parameters or check the processing steps.
In conclusion, to get good results, we need theory to make the analysis frame, and the empirical experience would help us adjust our analysis and approach the target.