I. LOGISTIC

Please be sure you understand and are able to implement a logistic regression for your final. We saw it in class, but please take time to digest it. Logit.html

You will find many readings online. https://seantrott.github.io/binary_classification_R/#logistic_regression_in_practice

Find the balance between the breadth and the depth.

A. Implement the logistic regression on any dataset of your choice, and interpret your coefficients.

Tell us why you should not run a multivariate regression.

A. Solution:

Looking at dataset UCBAdmissions, which is an “Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.” (Source)

The research question here is, does the chance of getting admitted to grad school at UCB depend on one’s gender and the department they’re applying for?

df <- UCBAdmissions
print(df)

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

logit_model <- glm(Admit == "Admitted" ~ Gender + Dept,  
                   data = df,
                   weights=Freq, family="binomial")
summary(logit_model)

## 
## Call:
## glm(formula = Admit == "Admitted" ~ Gender + Dept, family = "binomial", 
##     data = df, weights = Freq)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.58205    0.06899   8.436   <2e-16 ***
## GenderFemale  0.09987    0.08085   1.235    0.217    
## DeptB        -0.04340    0.10984  -0.395    0.693    
## DeptC        -1.26260    0.10663 -11.841   <2e-16 ***
## DeptD        -1.29461    0.10582 -12.234   <2e-16 ***
## DeptE        -1.73931    0.12611 -13.792   <2e-16 ***
## DeptF        -3.30648    0.16998 -19.452   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6044.3  on 23  degrees of freedom
## Residual deviance: 5187.5  on 17  degrees of freedom
## AIC: 5201.5
## 
## Number of Fisher Scoring iterations: 6

The results of the logistic regression tell us that, when Gender and Dept are both at zero, the baseline chance of being admitted to UCB is at .58. A one-unit change in gender (i.e., being female) resulted in a .10 chance increase in getting admitted, but this prediction was not significantly different from zero (p=.217). What was significant, however, was the chance of being admitted to different departments when being female — in which, the coefficients were negative, i.e., being female lessened the chance of being admitted to the respective Departments (C-F). On average, males were almost twice as more likely to be admitted than females (1198 males vs 557 females admitted).

Logistic regression models are most often used to estimate/predict the probability of an event occurring based on a given set of independent variables; in other words, logistic regressions are for when the dependent variable is a categorical outcome (e.g., binary 0 or 1, or Likert responses ranging from 1-5); a multivariate regression would not be appropriate with this type of dataset (i.e., binary) because multivariate regressions predict continuous dependent variables and explore linear relationships.

EXTRA -

Since some of you asked, have a look at the logistic regression we are running to explain hospital bankruptcy (real world example). Try to read the code in 20 minutes and summarise in one paragraph. You will see many concepts we have already covered in class like merging datasets, correlation analysis, Box Coxx, and even creating a training and testing data to test the model prediction. https://rpubs.com/sharmaar2/Bankruptcy_Paper_Miniator_Cross

Extra response:

Hospital bankruptcy seemed to have occurred more severely in some states over others, such as NJ, CT, and RI.

II. REFLECTION

Please reflect over the last 14 weeks - maybe even skim over the material that we have seen, to consolidate the topics we have seen in class.

What have your learned about data analysis - both theoretically and empirically?

Response

I found it fascinating (and a real challenge at times) to have to work with big datasets in this course! I especially learned about the importance of visualizing and making meaningful interpretations with real-world data analyses. I really took a lot from the lessons on probability and bayesian statistics — our field has been pushing for Bayesian analysis, and now it makes a lot more sense why — we should be testing our hypotheses to what we see in the general population.

On a more personal note, I’ve never been a fan of statistics, but getting to re-explore the theoretical assumptions through simple tests and models have been helpful!

Week 7 Discussion D

Jiwon Ban

2024-04-30