Want to maximize Adj. R squared \(\text{Adj. } R^2 = 1- \frac{\frac{SS{Error}}{(n-p-1}}{\frac{SS{Total}}{(n-1)}}\)
Want to minimize AIC & BIC
♥ OR ♥
♥ OR ♥
library(tidyverse)
library(knitr)
library(broom)
library(leaps)
sat_scores <- read_csv("data/sat-scores.csv")
# Forward Selection:
int_only_model <- lm(SAT ~ 1, data = sat_scores)
selected_model <- step(int_only_model, scope = formula(full_model), direction = "forward")
tidy(selected_model)
# Backward Selection:
step(full_model, scope = formula(int_only_model), direction = "backward")
reg_backward <- regsubsets(SAT ~ ., data = sat_scores, method="backward")
# Summary of selection results:
sel_summary <- summary(reg_backward)
# Choose the model with minumum BIC:
coef(reg_backward, which.min(sel_summary$bic))
#Show selected model
selected_model <- lm(SAT ~ Years + Expend + Rank, data = sat_scores)
tidy(selected_model)
coef(reg_backward, which.max(sel_summary$adjr2))
Problem: 2 variables have a perfect linear association with each other, could not find unique estimates for model coefficients * Why is this a problem? Standard errors for our regression coefficients inflate, lose precision in our estimates of the regression coefficients, impedes our ability to use th emodel fo rinference or prediction
library(tidyverse)
library(knitr)
library(broom)
tips <- read_csv("data/tip-data.csv") %>%
filter(!is.na(Party))
tips_model <- lm(Tip ~ Party + Meal + Age, data = tips)
tidy(tips_model) %>%
kable(digits = 3)
tips_aug <- augment(tips_model) %>%
mutate(obs_num = row_number()) #add row number to help with graphing
# Calculate Leverage threshold
leverage_threshold <- 2 * (5 + 1) / nrow(tips) #2(p+1)/n
leverage_threshold
# Leverage vs. Observation Number
ggplot(data = tips_aug, aes(x = obs_num, y = .hat)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = leverage_threshold, color = "red") +
labs(x = "Observation Number", y = "Leverage") +
geom_text(aes(label=ifelse(.hat > leverage_threshold,
as.character(obs_num), "")), nudge_x = 4)
# Filter for number of people above threshold
tips_aug %>%
filter(.hat > leverage_threshold) %>%
select(Party, Meal, Age)
These have higher number of people in party than rest of the data.
# scatterplot of std. residuals vs. predicted
ggplot(data = tips_aug, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = c(-2,2), color = "blue", lty = 2) +
geom_hline(yintercept = c(-3,3), color = "red", lty = 2) +
ylim(-4,4) +
geom_text(aes(label=ifelse(abs(.std.resid) > 2,
as.character(obs_num), "")), nudge_x = 1)
## scatterplot of Cook's D vs. observation number
tips_aug %>%
filter(abs(.std.resid) > 3)
Looking at what tips look at according to party size and bill size to show that the typical tips size is according to bill size to see if these tips are abnormally large.
ggplot(data = tips_aug, aes(x = obs_num, y = .hat)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = c(0.5, 1), color = "red") +
labs(x = "Observation Number", y = "Cook's D") +
geom_text(aes(label=ifelse(.hat > 0.5,
as.character(obs_num), "")), nudge_x = 1)
install.packages("rms")
# load rms package and calculate VIF
library(rms)
vif(tips_model) %>%
tidy()
|
Logistic Regression 2 outcomes 1: yes, 0: no |
♥ ♥ ♥ |
Multinomial Logistic Regression 3+ outcomes 1: Democrat, 2: Republican, 3: Independent |
Response: the model produces predictions between 0 & 1: probabilities
Logistic Regression Model only produces predictions between 0 & 1
|
Linear Regression Quantitative \[Y = B_0 + B_1X\] |
♥ ♥ ♥ |
Linear Reg. transform Y Quantitative \[log(Y) = B_0 + B_1X\] |
♥ ♥ ♥ |
Logistic Reg. Binary \[logodds = log(\frac{\pi}{1-\pi}) = B_0 + B_1X\] |
\[ Y = 1: yes, 0: no \\ \pi: \text{probability that Y = 1, ie., P(Y=1)} \\ \frac{\pi}{1-\pi}: \text{odds that Y = 1} \\ log(\frac{\pi}{1-\pi}): \text{log odds} \]
If p = 0.7, not ___ is 1-p = 0.3
Odds are 7 to 3, 7:3, 0.7/0.3
\[\omega = \frac{\pi}{1-\pi} \text{ & } \pi = \frac{\omega}{1-\omega} \]
\[\textbf{1. Logistic model: } \text{log odds = } log(\frac{\pi}{1-\pi}) = B_0 + B_1X \\ \textbf{2. Odds = } exp(log(\frac{\pi}{1-\pi})) = \frac{\pi}{1-\pi} \\ \textbf{3. Probability = } \pi = \frac{exp(B_0+B_1X)}{1+exp(B_0+B_1X)}\]
\[ \textbf{Logit form: } log(\frac{\pi}{1-\pi}) = B_0 + B_1X \\ \textbf{Probability Form: } \pi = \frac{exp(B_0+B_1X)}{1+exp(B_0+B_1X)} \\ \textbf{Predicted odds: } \hat{\omega}=\frac{\hat{\pi}}{1-\hat{\pi}} \\ \textbf{Predicted Probability: } \hat{\pi}=\frac{exp(logodds)}{1+exp(logodds)} \]
library(tidyverse)
library(knitr)
library(broom)
heart <- read_csv("data/framingham.csv") %>%
select(totChol, TenYearCHD) %>%
filter(!is.na(totChol)) %>% #remove observations with missing cholesterol
mutate(high_risk = as.factor(TenYearCHD))
# Probability
heart %>%
count(high_risk) %>%
mutate(prob = n/sum(n))
P(not high risk) = 0.848
# Odds
0.848/0.152
The odds that a person is not high risk of getting heart disease are about 5.58.
#glm function for logistic regression model
heart_model <- glm(high_risk ~ totChol, data = heart, family = "binomial")
tidy(heart_model)
log-odds(high-risk) = -2.894 + 0.005 * totChol
Based on the model, if a randomly selected person has a total cholesterol of 225:
# log-odds they have a high risk of heart disease
x0 <- tibble(totChol = 225)
predict(heart_model, x0)
# odds they have a high risk of heart disease
odds <- exp(-1.796071)
odds
# probability they have a high risk of heart disease
prob <- odds/(1 + odds)
prob
Based on the model, if a randomly selected person has a total cholesterol of 226
# log-odds they have a high risk of heart disease
x1 <- tibble(totChol = 226)
log_odds_x1 <- predict(heart_model, x1)
log_odds_x1
# odds they have a high risk of heart disease
odds_x1 <- exp(-1.79119)
odds_x1
# probability they have a high risk of heart disease
prob_x1 <- odds_x1 / (1 + odds_x1)
prob_x1
log-odds increase by 0.00488 (coefficient for total cholestral from logistic regression model)
totlChol in terms of the log-odds?For every one mg/dL increase in totChol the expected log-odds of being high risk are predicted to increase by 0.00488
totChol in terms of the odds?For every one mg/dL increase in totChol the expected odds are multiplied by 1.005.
Review: odds = probability of success / probability of failure
♥ OR ♥
# of successes / # of failures
\[OR = \frac{odds_1}{odds_2} = \frac{\omega_1}{\omega_2}\] "The odds of ___ are ___ times higher for those with ___ than those with ___"
Log-odds, interpret to the baseline group
"The odds of ___ are expected to be \(exp(\hat{\beta_j})\) times the odds of ___"
\[OR = exp(\hat{\beta_j})\]
“For each additional ____, the odds of ___ are expected to multiply by a factor of \(exp(\hat{\beta_j})\)”
"The odds of ___ for __ are expected to be \(exp(\hat{\beta_j})\) times the odds for ___, holding all else constant"
!(Users/courtneylee/Documents/STAT210/Quiz 3/1.png)
\[H_0:B_j=0, H_A:B_j \neq 0\]
For proortions or probabilities, use z test statistic \[z=\frac{\hat{\beta_j}-0}{SE_\hat{\beta_j}} \\ \text{p-value: } p(\lvert Z \rvert > \lvert z \rvert) \text{(z in standard normal distribution)}\]
\[B_j \pm z^*SE_\hat{\beta_j}\] Log odds!!! “For every 1 unit increase in x the LOG ODDS…”
Interpret for odds!!
\[CI: exp(B_j \pm z^*SE_\hat{\beta_j})\]
“We are C% Confident that for every one unit change in \(x_j\), the odds multiply by a factor of \(exp(B_j - z^*SE_\hat{\beta_j})\) to \(exp(B_j + z^*SE_\hat{\beta_j})\), holding all else constant.”
\[H_0: \beta_{q+1}=...=\beta_p=0 \\ H-A: \text{at least 1 } \beta_j \text{ is not 0}\] Use Drop in Deviance Test!!
\[G=(-2logL_{reduced})-(-2logL_{full}) \\ \text{p-value: } P(\chi^2>G)\] * use chi squared distribution with dof = # parameters in full - # parameters in reduced
If P value from chisq test is very small,
Conclusion: “The p-value is very small, so we reject the null hypothesis. The data provide sufficient evidence that the coefficient of ___ is not equal to 0. Therefore, we should add it to the model.” * We can also use anova function to conduct test
Use AIC or BIC, choose model with smaller AIC or BIC
library(tidyverse)
library(knitr)
library(broom)
leukemia <- read_csv("data/leukemia.csv")
Age to predict whether or not a patient responded to the treatment.leukemia <- leukemia %>%
mutate(resp = factor(Resp))
model1 <- glm(resp ~ Age, data = leukemia, family = "binomial")
tidy(model1) %>%
kable(digits = 3)
For every 1 year increase in age, the odds a patient responded to the treatment are expected to. multiply by a factor of 0.954.
We have a z test statistic. The distribution of the z test statistic is N(0,1), the standard normal distribution.
Our p-value is 0.017 < 0.05, so we reject H0. The data provides sufficient evidence that the coefficient for Age is not equal to 0. Age is a statistically significant predictor of the odds a patient responds to the treatment
full_model <- glm(resp ~ Age + Smear + Infil + Index + Blasts + Temp, data = leukemia, family = "binomial")
tidy(full_model) %>%
kable(digits = 3)
Age, index, and temp are statistically significant because they have small p-values.
Think about Age: “Statistically significant” that Age is a significant predictor of the odds a patient responds to the treatment, even after accounting for the all the other pre-treatment variables.
reduced_model <- glm(resp ~ Age + Index + Temp, data = leukemia, family = "binomial")
tidy(full_model) %>%
kable(digits = 3)
anova(reduced_model, full_model, test = "Chisq") %>%
tidy()
For my model, I would choose the one that only has the significant predictors, because the p-value of the drop-in-deviance test is large.
library(tidyverse)
library(broom)
library(knitr)
gss <- read_csv("https://sta210-fa20.netlify.app/data/gss2016.csv",
na = c("", "Don't know", "No answer",
"Not applicable"),
guess_max = 2867) %>%
select(natmass, age, sex, sei10, region, polviews) %>%
drop_na()
The goal of the analysis is to understand the factors that are associated with a person being satisfied with the current spending on mass transportation. Create a new variable that is equal to “1” if a person said spending on mass transportation is about right and “0” otherwise.
# creates new variable
gss <- gss %>%
mutate(satisfied = if_else(natmass == "About right", "1", "0"))
Recode polviews so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.
Make a plot of the distribution of polviews. Which political view occurs most frequently in this data set?# re-levels polviews
gss <- gss %>%
mutate(polviews = factor(polviews,
levels = c("Extremely liberal", "Liberal", "Slightly liberal",
"Moderate", "Slghtly conservative", "Conservative",
"Extrmly conservative")))
# plots polview distribution
ggplot(data = gss, aes(x = polviews)) +
geom_bar() +
labs(x = "Political View",
y = "Count",
title = "Distribution of Political Views")
Moderate views occur most frequently in this dataset.
Make a plot displaying the relationship between satisfaction with mass transportation spending and political views. Use the plot to describe the relationship between a person’s political views and whether they are satisfied with spending on mass transportation.
# plots distribution of polviews vs satisfied
ggplot(data = gss, aes(x = polviews, fill = satisfied)) +
geom_bar(position = "fill") +
labs(x = "Political View",
y = "Satisfaction with Transportation Spending",
title = "Satisfaction with Spending vs. Political Views")
We’d like to use age as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as “89 or older”. Recode age so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values “89 or older” with a single value.
# makes age numeric
gss <- gss %>%
mutate(age = if_else(age == "89 or older", '89', age)) %>%
mutate(age = as.numeric(age))
Briefly explain why we should use a logistic regression model to predict the odds a randomly selected person is satisfied with spending on mass transportation.
We should use a logistic regression model because our response variable is binary and we need a logistic regression model to only produce predictions between 0 and 1 as we want to predict probability.
Let’s start by fitting a model using the demographic factors - age, sex, sei10, and region - to predict the odds a person is satisfied with spending on mass transportation. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation.
# mean-centers variables
gss <- gss %>%
mutate(satisfied = as.numeric(satisfied)) %>%
mutate(age = age - mean(age)) %>%
mutate(sei10 = sei10 - mean(sei10))
# creates model
gss_model <- glm(satisfied ~ age + sex + sei10 + region, data = gss, family = "binomial")
tidy(gss_model) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.346 | 0.103 | 3.348 | 0.001 |
| age | -0.006 | 0.002 | -2.725 | 0.006 |
| sexMale | -0.261 | 0.080 | -3.247 | 0.001 |
| sei10 | -0.006 | 0.002 | -3.561 | 0.000 |
| regionE. sou. central | -0.168 | 0.180 | -0.936 | 0.349 |
| regionMiddle atlantic | 0.061 | 0.154 | 0.395 | 0.693 |
| regionMountain | -0.137 | 0.167 | -0.824 | 0.410 |
| regionNew england | -0.652 | 0.188 | -3.465 | 0.001 |
| regionPacific | -0.384 | 0.142 | -2.708 | 0.007 |
| regionSouth atlantic | -0.048 | 0.131 | -0.367 | 0.713 |
| regionW. nor. central | -0.055 | 0.181 | -0.304 | 0.761 |
| regionW. sou. central | 0.181 | 0.159 | 1.142 | 0.253 |
Interpret the intercept in the context of the data. Include any relevant values in your response.
The odds that a person said that the spending on mass transportation is about right is 1.413 (exp(0.346)) if the person has the mean age, is female, has the mean socioeconomic status, and is interviewed in the East North Central region.
Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of being satisfied with spending on mass transportation.
For every one year increase in age, odds of being satisfied on spending on mass transportation are expected to multiply by a factor of exp(-0.006) = 0.994, holding all else constant.
Now let’s see whether a person’s political views has a significant impact on their odds of being satisfied with spending on mass transportation, after accounting for the demographic factors.
Conduct a drop-in-deviance test to determine if polviews is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypotheses in words, display all relevant code and output, and state your conclusion in the context of the problem.
# creates model
gss_pol <- glm(satisfied ~ age + sex + sei10 +
region + polviews, data = gss, family = "binomial")
tidy(gss_pol, conf.int = TRUE) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -0.423 | 0.208 | -2.035 | 0.042 | -0.837 | -0.020 |
| age | -0.008 | 0.002 | -3.349 | 0.001 | -0.012 | -0.003 |
| sexMale | -0.277 | 0.082 | -3.392 | 0.001 | -0.437 | -0.117 |
| sei10 | -0.005 | 0.002 | -2.739 | 0.006 | -0.008 | -0.001 |
| regionE. sou. central | -0.204 | 0.182 | -1.120 | 0.263 | -0.560 | 0.153 |
| regionMiddle atlantic | 0.099 | 0.156 | 0.633 | 0.527 | -0.206 | 0.405 |
| regionMountain | -0.097 | 0.169 | -0.575 | 0.565 | -0.427 | 0.234 |
| regionNew england | -0.562 | 0.191 | -2.943 | 0.003 | -0.940 | -0.190 |
| regionPacific | -0.357 | 0.144 | -2.483 | 0.013 | -0.639 | -0.076 |
| regionSouth atlantic | -0.050 | 0.133 | -0.379 | 0.705 | -0.310 | 0.210 |
| regionW. nor. central | -0.062 | 0.183 | -0.339 | 0.735 | -0.421 | 0.298 |
| regionW. sou. central | 0.147 | 0.161 | 0.918 | 0.359 | -0.167 | 0.464 |
| polviewsLiberal | 0.248 | 0.217 | 1.144 | 0.253 | -0.174 | 0.678 |
| polviewsSlightly liberal | 0.606 | 0.220 | 2.754 | 0.006 | 0.178 | 1.042 |
| polviewsModerate | 0.926 | 0.196 | 4.716 | 0.000 | 0.546 | 1.317 |
| polviewsSlghtly conservative | 0.845 | 0.214 | 3.949 | 0.000 | 0.430 | 1.270 |
| polviewsConservative | 0.967 | 0.212 | 4.552 | 0.000 | 0.555 | 1.390 |
| polviewsExtrmly conservative | 1.314 | 0.281 | 4.675 | 0.000 | 0.770 | 1.873 |
# drop-in-deviance test
anovamodel <- anova(gss_model, gss_pol, test = "Chisq")
tidy(anovamodel) %>%
kable(digits = 3)
| Resid..Df | Resid..Dev | df | Deviance | p.value |
|---|---|---|---|---|
| 2578 | 3518.199 | NA | NA | NA |
| 2572 | 3460.786 | 6 | 57.413 | 0 |
Null Hypothesis: The coefficients associated with political views are all zero.
Alternative Hypothesis: At least one of the coefficients associated with political views is not zero.
Conclusion: The p-value is near zero, so there is sufficient evidence to reject the null hypothesis. The data provides sufficient evidence that at least one of the coefficients associated with political views is not zero, and political views is a significant predictor of the attitude towards spending on mass transportation.
Use the model to describe the relationship between one’s political views and their attitude towards spending on mass transportation. Be sure your answer includes the interpretation of the model coefficients and associated hypothesis tests or confidence intervals used to support your response.
* The odds of being satisfied on spending on mass transportation for liberal views are expected to be exp(0.248) = 1.281 times the odds for extremely liberal views, holding all else constant. The p-value is greater than 0.05, which means that there is not sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is liberal.
The odds of being satisfied on spending on mass transportation for slightly liberal views are expected to be exp(0.606) = 1.833 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is slightly liberal.
The odds of being satisfied on spending on mass transportation for moderate views are expected to be exp(0.926) = 2.524 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is moderate.
The odds of being satisfied on spending on mass transportation for slightly conservative views are expected to be exp(0.845) = 2.328 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is slightly conservative.
The odds of being satisfied on spending on mass transportation for conservative views are expected to be exp(0.967) = 2.630 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is conservative.
The odds of being satisfied on spending on mass transportation for extremely conservative views are expected to be exp(1.314) = 3.721 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is extremely conservative.
library(tidyverse)
library(broom)
library(knitr)
medGPA <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 3/MedGPA.csv")
birth <- matrix(c(9584, 4792, 14376, 2.4, 1.1, 2), ncol=3, byrow=TRUE)
colnames(birth) <- c("With Defects","Without Defects", "Total")
rownames(birth) <- c("# babies","% mothers used IVF")
birth <- as.table(birth)
birth
## With Defects Without Defects Total
## # babies 9584.0 4792.0 14376.0
## % mothers used IVF 2.4 1.1 2.0
IVF <- matrix(c(230, 9354, 9584, 53, 4739, 4792, 283, 14093, 14376), ncol = 3, byrow=TRUE)
colnames(IVF) <- c("Used IVF", "Did not use IVF", "Total")
rownames(IVF) <- c("# of babies with defects", "# of babies without defects", "Total")
IVF <- as.table(IVF)
IVF
## Used IVF Did not use IVF Total
## # of babies with defects 230 9354 9584
## # of babies without defects 53 4739 4792
## Total 283 14093 14376
Odds of a birth defect = Total # babies with defects/Total # babies without defects
# Odds of a birth defect
9584/4792
## [1] 2
The odds that a baby in this study was born with a birth defect is 2.
# Odds of having baby with defect
.03/.97
## [1] 0.03092784
The odds of having a baby with a birth defect is 0.031 in reality
# Ratio of odds of birth defect in the study and odds of birth defect in reality
2/0.03092784
## [1] 64.66666
The odds of a baby having a birth defect in the study is 64.7 times higher than the odds of a baby having a birth defect in reality.
Find the odd ratio of the odds of a baby having a birth defect from mothers who used IVF to the odds of a baby having a birth defect from mothers who did not use an IVF.
Odds of a birth defect from mothers who use IVF = 230/53 = 4.34
Odds of a birth defect from mothers who did not use IVF = 9354/4739 = 1.97
Odds ratio = 4.34/1.97 = 2.20
The odds of a baby with a birth defect is 2.20 times higher for a mother using IVF that a mother not using an IVF. However, we can not conclude that this data supports the finding’s conclusion because the study has a disproportionately large number of babies who had defects. As seen in the odds ratio in the beginning of question 3, the odds of having a baby with a birth defect in the study is much greater than the odds in reality. Based on the data from the findings, I would conclude that the IVF DOES carry excessive risks, which does not support the researcher’s conclusion.
acceptance <- glm(Acceptance ~ MCAT + GPA, data = medGPA, family = "binomial")
tidy(acceptance)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -22.4 6.45 -3.47 0.000527
## 2 MCAT 0.165 0.103 1.59 0.111
## 3 GPA 4.68 1.64 2.85 0.00439
log-odds(acceptance) = -22.37 + 0.16 * MCAT + 4.68 * GPA
Increasing GPA by 1 point increases the odds of being accepted into medical school much more than increasing the MCAT score by one point. However, 1 whole GPA score increase might be much more difficult to achieve than 1 point increase in the MCAT score. To make this interpretation more useful, we can state that for every 0.01 point increase in GPA, the odds of acceptance into medical school are expected to multiply by 1.048.
However, because the GPA p-value is significant and the MCAT p-value is not significant, we can infer that improving GPA will have a bigger effect on the odds of acceptance relative to the MCAT score.
acceptance2 <- glm(Acceptance ~ MCAT + GPA + Apps, data = medGPA, family = "binomial")
tidy(acceptance2, conf.int = TRUE) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -23.689 | 7.024 | -3.373 | 0.001 | -39.590 | -11.662 |
| MCAT | 0.173 | 0.105 | 1.641 | 0.101 | -0.021 | 0.398 |
| GPA | 4.861 | 1.694 | 2.869 | 0.004 | 1.839 | 8.576 |
| Apps | 0.044 | 0.076 | 0.575 | 0.565 | -0.106 | 0.198 |
\(H_0: \beta_{apps} = 0 \\ H_A: \beta_{apps} \neq 0\)
The null hypothesis states that the number of applications is not a predictor for odds of acceptance, and the alternative hypothesis states that the number of applications is not a predictor for odds of acceptance The p-value of \(\beta_{apps}\) is 0.565 > 0.05. Therefore we fail to reject the null hypothesis, and there is not sufficient evidence from the data that the number of applications predict acceptance into medical school after accounting for MCAT score and GPA.
\(H_0: \beta_{q+1} = ... = \beta_p = 0 \\ H_A: \text{at least one } \beta_{j} \neq 0\)
The null hypothesis states that the coefficients in the full model as well as the interactions between sex and MCAT and sex and GPA are not significant additions to the model. The alternative hypothesis states that at least one of the coefficients in the full model is a significant predictor addition to the model.
model_reduced <- glm(Acceptance ~ MCAT + GPA, data = medGPA, family = "binomial")
model_full <- glm(Acceptance ~ MCAT + GPA + Sex + Sex * MCAT + Sex * GPA,
data = medGPA, family = "binomial")
tidy(model_full) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -24.424 | 10.037 | -2.433 | 0.015 |
| MCAT | 0.024 | 0.147 | 0.160 | 0.873 |
| GPA | 6.857 | 3.185 | 2.153 | 0.031 |
| SexM | -6.155 | 16.407 | -0.375 | 0.708 |
| MCAT:SexM | 0.326 | 0.237 | 1.376 | 0.169 |
| GPA:SexM | -1.948 | 4.162 | -0.468 | 0.640 |
anova(model_reduced, model_full, test = "Chisq") %>%
tidy()
## # A tibble: 2 x 5
## Resid..Df Resid..Dev df Deviance p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 52 54.0 NA NA NA
## 2 49 48.6 3 5.39 0.146
The p-value is 0.146 > 0.05. Therefore, we reject the null hypothesis and conclude that there is not enough evidence from the data to conclude that after accounting for GPA and MCAT score, sex and interactions between sex and GPA and MCAT are not significant additions as predictors for odds of medical school acceptance. According to this drop in deviance model, the reduced model which just includes GPA and MCAT is the best model. There is also no evidence that the effect of MCAT score or GPA differs for males and females since these are not significant predictors in the model.