1

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(mtcars)
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
mtcars %>%
  group_by(vs) %>%
  summarise(avg_disp = mean(disp), sd_disp = sd(disp))
## # A tibble: 2 × 3
##      vs avg_disp sd_disp
##   <dbl>    <dbl>   <dbl>
## 1     0     307.   107. 
## 2     1     132.    56.9
model1 <- glm(formula = vs ~ cyl+hp+wt+disp, 
              family = binomial(link = "logit"),
              data = mtcars)

# regression output
summary(model1)
## 
## Call:
## glm(formula = vs ~ cyl + hp + wt + disp, family = binomial(link = "logit"), 
##     data = mtcars)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  8.902672   5.045309   1.765   0.0776 .
## cyl         -2.083545   1.555981  -1.339   0.1806  
## hp          -0.046461   0.037354  -1.244   0.2136  
## wt           3.402321   2.252263   1.511   0.1309  
## disp        -0.007595   0.019185  -0.396   0.6922  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.860  on 31  degrees of freedom
## Residual deviance: 12.538  on 27  degrees of freedom
## AIC: 22.538
## 
## Number of Fisher Scoring iterations: 7
exp(-2.083545)
## [1] 0.1244881
exp(-0.046461)
## [1] 0.9546018
exp(3.402321)
## [1] 30.03373
exp(-0.007595)
## [1] 0.9924338

The coefficients mean: increasing the number of cylinders by 1 unit multiplies the odds of having the engine straight by 0.1245; increasing the gross horsepower by 1 unit multiplies the odds of having the engine straight by 0.9546 increasing the number of weight by 1 unit multiplies the odds of having the engine straight by 30.0337 increasing thedisplacement by 1 unit multiplies the odds of having the engine straight by 0.9924.

We should not run a mulitple linear regression since the outcome is binary, linear regression will give the prediction of a specific number such as 0.3256 but not the probability of the outcome being 0 or 1.

First, load the necessary packages to the environment, then load the functions that we create. Then load the data files and check missing values, create a new dataframe that summarizes the bankruptcies and visualize it. Do some exploratory data analysis, check the summary statistics of data, check the correlations of variable. Split the dataset to training and testing dataset, rebalance the data due to the highly unbalanced distribution, using glmnet function to build to logistic regression model, select different variables to be included in the model, build three different models and compare it.

  1. Reflection

Review on the last 14 weeks, we can find that data analysis depends on both theoretical knowledge and empirical experience. Before the analysis, we should have a frame to conduct the analysis, that mainly depends on our theory knowledge, we should consider the potential distribution of data, find the proper models for the data, it usually depends on our aim: if we want to predict a numerial variable, we need regression model, if we want to find the group structure, we need cluster methods. Theory would also tell us how to explain our model and how to conduct test to the model.

However, once we start to do the analysis, there are many stuff beyond theory, we should adjust the plan with the data, for example, if there are many missing values, we should deal with them. Sometimes, the data might not meet our assumptions such as normality, we should make some adjustments to the data or change our target model. Also, we might get some insignificant results, it would need us to change the parameters or check the processing steps.

In conclusion, to get good results, we need theory to make the analysis frame, and the empirical experience would help us adjust our analysis and approach the target.