Please be sure you understand and are able to implement a logistic regression for your final. We saw it in class, but please take time to digest it. Logit.html
You will find many readings online. https://seantrott.github.io/binary_classification_R/#logistic_regression_in_practice
Find the balance between the breadth and the depth.
Tell us why you should not run a multivariate regression.
Looking at dataset UCBAdmissions, which is an “Aggregate
data on applicants to graduate school at Berkeley for the six largest
departments in 1973 classified by admission and sex.” (Source)
The research question here is, does the chance of getting admitted to grad school at UCB depend on one’s gender and the department they’re applying for?
df <- UCBAdmissions
print(df)
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
logit_model <- glm(Admit == "Admitted" ~ Gender + Dept,
data = df,
weights=Freq, family="binomial")
summary(logit_model)
##
## Call:
## glm(formula = Admit == "Admitted" ~ Gender + Dept, family = "binomial",
## data = df, weights = Freq)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.58205 0.06899 8.436 <2e-16 ***
## GenderFemale 0.09987 0.08085 1.235 0.217
## DeptB -0.04340 0.10984 -0.395 0.693
## DeptC -1.26260 0.10663 -11.841 <2e-16 ***
## DeptD -1.29461 0.10582 -12.234 <2e-16 ***
## DeptE -1.73931 0.12611 -13.792 <2e-16 ***
## DeptF -3.30648 0.16998 -19.452 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 23 degrees of freedom
## Residual deviance: 5187.5 on 17 degrees of freedom
## AIC: 5201.5
##
## Number of Fisher Scoring iterations: 6
The results of the logistic regression tell us that, when Gender and Dept are both at zero, the baseline chance of being admitted to UCB is at .58. A one-unit change in gender (i.e., being female) resulted in a .10 chance increase in getting admitted, but this prediction was not significantly different from zero (p=.217). What was significant, however, was the chance of being admitted to different departments when being female — in which, the coefficients were negative, i.e., being female lessened the chance of being admitted to the respective Departments (C-F). On average, males were almost twice as more likely to be admitted than females (1198 males vs 557 females admitted).
Logistic regression models are most often used to estimate/predict the probability of an event occurring based on a given set of independent variables; in other words, logistic regressions are for when the dependent variable is a categorical outcome (e.g., binary 0 or 1, or Likert responses ranging from 1-5); a multivariate regression would not be appropriate with this type of dataset (i.e., binary) because multivariate regressions predict continuous dependent variables and explore linear relationships.
Since some of you asked, have a look at the logistic regression we are running to explain hospital bankruptcy (real world example). Try to read the code in 20 minutes and summarise in one paragraph. You will see many concepts we have already covered in class like merging datasets, correlation analysis, Box Coxx, and even creating a training and testing data to test the model prediction. https://rpubs.com/sharmaar2/Bankruptcy_Paper_Miniator_Cross
Hospital bankruptcy seemed to have occurred more severely in some states over others, such as NJ, CT, and RI.
Please reflect over the last 14 weeks - maybe even skim over the material that we have seen, to consolidate the topics we have seen in class.
What have your learned about data analysis - both theoretically and empirically?
I found it fascinating (and a real challenge at times) to have to work with big datasets in this course! I especially learned about the importance of visualizing and making meaningful interpretations with real-world data analyses. I really took a lot from the lessons on probability and bayesian statistics — our field has been pushing for Bayesian analysis, and now it makes a lot more sense why — we should be testing our hypotheses to what we see in the general population.
On a more personal note, I’ve never been a fan of statistics, but getting to re-explore the theoretical assumptions through simple tests and models have been helpful!