Model Selection

Notes

Explanation

  • Goal: Identify variables that are important in explaining variation in the response
  • Interpret any variables of interest
  • Include all variables you think are related to the response even if they are not statistically significant
  • Interpret coefficients with caution, especially if there are problems with multicolinearity in the model

Model Selection Criteria

Want to maximize Adj. R squared \(\text{Adj. } R^2 = 1- \frac{\frac{SS{Error}}{(n-p-1}}{\frac{SS{Total}}{(n-1)}}\)

Want to minimize AIC & BIC

Selection Process: Backward Selection

  • Start with model that includes all variables of interest, dropping variables one at a time that are deemed irrelevant based on some criterion.
  • Common Criterion:
  • Drop variable that results in model with highest Adj. R squared

♥ OR ♥

  • Drop variable that results in model with lowest value of AIC or BIC
  • Stop when no more variables can be removed

Selection Process: Backward Selection

  • Start with intercept only model (no prediction)
  • Include variables one at a time based on common criterion:
  • Add variables for model until you get the model with highest Adj. R squared

♥ OR ♥

  • Add variables for model until you get the model with lowest AIC or BIC

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(leaps)

sat_scores <- read_csv("data/sat-scores.csv")

Step Function

# Forward Selection:
int_only_model <- lm(SAT ~ 1, data = sat_scores)
selected_model <- step(int_only_model, scope = formula(full_model), direction = "forward")
tidy(selected_model)
# Backward Selection:
step(full_model, scope = formula(int_only_model), direction = "backward")

Regsubsets Function

reg_backward <- regsubsets(SAT ~ ., data = sat_scores, method="backward")
# Summary of selection results:
sel_summary <- summary(reg_backward)

# Choose the model with minumum BIC:
coef(reg_backward, which.min(sel_summary$bic))

#Show selected model
selected_model <- lm(SAT ~ Years + Expend + Rank, data = sat_scores)
tidy(selected_model)

Regsubsets with Adj. R Squared

coef(reg_backward, which.max(sel_summary$adjr2))

Model Diagnostics

Notes

Influential Point

  • If influential, removing it substantially changes the coefficients of the regression model and standard errors
  • Can be identified in scatterplot if 1 predictor variable
  • Measures: leverage, standardized residuals, & Cook’s Distance

Leverage

  • Leverage: measure of the distance between an observation’s values of the predictor variables and the average values of the predictor variables for the entire data set
  • high leverage if combination of values is very far from typical combination values for predictor variables in data (potential influential points) \(\text{sum of the leverages for all points: } p+1 \\ \text{typical leverage: } \frac{p+1}{n}\) \(\text{If } h_i > \frac{2(p+1)}{n} \text{, then high leverage}\)
  • Is there a data entry error? Is this within the scope of indiv. you want to make predictions? Is this observation impacting the estimates of the model coefficients, espeically for interactions?
  • Check other measures! Leverage vs. obs

Standardized & Studentized Residuals

  • Best way to identify outliers: look for large residuals, want a common scale \(\textbf{Std. res_i } = \frac{}{\text{regression standard error ->}}\frac{y_i-\hat{y_i}}{{\hat\sigma_E}\sqrt{1-h_i}} \\ \textbf{stud. resi_i } = \frac{y_i-\hat{y_i}}{{\hat\sigma_i}\sqrt{1-h_i}} \\ \text{stud.resi_i has regression standard error from fitting the model with the ith point removed}\)
  • Observations with high leverage tend to have low values of standardized residuals b/c they pull the regression line towards them.
  • This is avoided using studentized residuals because it is without the influential point
  • augment function in Col. .std.resid (standardized residuals)
  • calculate studentized residuals using .sigma, .resid, & .hat
  • Moderate Outlier: std.res. beyond +/- 2
  • Serious Outlier: std.res. beyond +/- 3
  • Make residual plots with std.res. to make it easier to identify outliers & check constant variance condition

Cook’s Distance

  • Includes both std. resid (how close it lies to general trend of data) & leverage (h_i) to measure an observations overall impact on the model \(D_i = \frac{(\text{std.resi}_i)^2}{p+1}*\frac{h_i}{1-h_i}\) *Large D_i has strong influence on the predicted values
  • Moderately Influential: D_i > 0.5
  • Very Influential: D_i > 1
  • If outlier found, compare coefficients of data both with and without that point

Using These Measures

  • Std. res, leverage, & Cook’s distance should all be examined together.
  • Examine plots to find outliers, high leverage
  • Drop if makes sense, avoid extrapolation *NOT ok to drop based on response variable!

Multicollinearity

Problem: 2 variables have a perfect linear association with each other, could not find unique estimates for model coefficients * Why is this a problem? Standard errors for our regression coefficients inflate, lose precision in our estimates of the regression coefficients, impedes our ability to use th emodel fo rinference or prediction

Defining Multicollinearity

  • High correlations (r > 0.9) among 2+ predictor variables, especially with small sample size
  • One or more predictor variables is an almost perfect linear combination of the others
  • Include a quadratic in the model mean-centering the variable first
  • Including interactions between 2+ continuous variable In EDA: look at correlation matrix and look for plots with linear relationship

Using VIF

  • Variance Inflation Factor: measure of multicollinearity in regression model
  • VIF > 10 indicates multicollinearity

Checking Independence

  • We can often check independence based on the context of the data and how the observations were collected
  • If the data were collected in a particular order, examine a scatterplot of the standardized residuals vs. order in which the data were collected
  • If not satisfied, fit a new model that includes x as a predictor to account for the systematic differences by region
  • Next, check model diagnostics to detect influential points or multicollinearity

Leverage Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)

tips <- read_csv("data/tip-data.csv") %>%
  filter(!is.na(Party)) 

Fit model for predictors and prediction

tips_model <- lm(Tip ~ Party + Meal + Age, data = tips)
tidy(tips_model) %>%
  kable(digits = 3)

Augment function for model diagnostics

tips_aug <- augment(tips_model) %>%
   mutate(obs_num = row_number()) #add row number to help with graphing

Leverage

# Calculate Leverage threshold
leverage_threshold <- 2 * (5 + 1) / nrow(tips) #2(p+1)/n
leverage_threshold
# Leverage vs. Observation Number
ggplot(data = tips_aug, aes(x = obs_num, y = .hat)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = leverage_threshold, color = "red") +
  labs(x = "Observation Number", y = "Leverage") +
  geom_text(aes(label=ifelse(.hat > leverage_threshold,
                             as.character(obs_num), "")), nudge_x = 4)
# Filter for number of people above threshold
tips_aug %>%
  filter(.hat > leverage_threshold) %>%
  select(Party, Meal, Age)

These have higher number of people in party than rest of the data.

Outliers Code

Identifying Outliers

# scatterplot of std. residuals vs. predicted
ggplot(data = tips_aug, aes(x = .fitted, y = .std.resid)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = c(-2,2), color = "blue", lty = 2) +
  geom_hline(yintercept = c(-3,3), color = "red", lty = 2) +
  ylim(-4,4) +
  geom_text(aes(label=ifelse(abs(.std.resid) > 2,
                             as.character(obs_num), "")), nudge_x = 1)
## scatterplot of Cook's D vs. observation number
tips_aug %>%
  filter(abs(.std.resid) > 3)

Looking at what tips look at according to party size and bill size to show that the typical tips size is according to bill size to see if these tips are abnormally large.

Cook’s Distance Code

ggplot(data = tips_aug, aes(x = obs_num, y = .hat)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = c(0.5, 1), color = "red") +
  labs(x = "Observation Number", y = "Cook's D") +
  geom_text(aes(label=ifelse(.hat > 0.5,
                             as.character(obs_num), "")), nudge_x = 1)

Multicollinearity Code

install.packages("rms")
# load rms package and calculate VIF
library(rms)
vif(tips_model) %>% 
  tidy()

Logistic Regression: Odds & Probabilities

Notes

Models for Categorical Response Variables

Logistic Regression
2 outcomes
1: yes, 0: no
♥ ♥ ♥ Multinomial Logistic Regression
3+ outcomes
1: Democrat, 2: Republican, 3: Independent

Response: the model produces predictions between 0 & 1: probabilities
Logistic Regression Model only produces predictions between 0 & 1

Different Types of Models

Linear Regression
Quantitative
\[Y = B_0 + B_1X\]
♥ ♥ ♥ Linear Reg. transform Y
Quantitative
\[log(Y) = B_0 + B_1X\]
♥ ♥ ♥ Logistic Reg.
Binary
\[logodds = log(\frac{\pi}{1-\pi}) = B_0 + B_1X\]

Binary Response Variable

\[ Y = 1: yes, 0: no \\ \pi: \text{probability that Y = 1, ie., P(Y=1)} \\ \frac{\pi}{1-\pi}: \text{odds that Y = 1} \\ log(\frac{\pi}{1-\pi}): \text{log odds} \]

Odds

If p = 0.7, not ___ is 1-p = 0.3
Odds are 7 to 3, 7:3, 0.7/0.3

From odds to probabilities

\[\omega = \frac{\pi}{1-\pi} \text{ & } \pi = \frac{\omega}{1-\omega} \]

Logistic Model: from odds to probabilities

\[\textbf{1. Logistic model: } \text{log odds = } log(\frac{\pi}{1-\pi}) = B_0 + B_1X \\ \textbf{2. Odds = } exp(log(\frac{\pi}{1-\pi})) = \frac{\pi}{1-\pi} \\ \textbf{3. Probability = } \pi = \frac{exp(B_0+B_1X)}{1+exp(B_0+B_1X)}\]

Logistic Regression Model

\[ \textbf{Logit form: } log(\frac{\pi}{1-\pi}) = B_0 + B_1X \\ \textbf{Probability Form: } \pi = \frac{exp(B_0+B_1X)}{1+exp(B_0+B_1X)} \\ \textbf{Predicted odds: } \hat{\omega}=\frac{\hat{\pi}}{1-\hat{\pi}} \\ \textbf{Predicted Probability: } \hat{\pi}=\frac{exp(logodds)}{1+exp(logodds)} \]

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)

heart <- read_csv("data/framingham.csv") %>%
  select(totChol, TenYearCHD) %>%
  filter(!is.na(totChol)) %>% #remove observations with missing cholesterol
  mutate(high_risk = as.factor(TenYearCHD))

Calculating Probability & Odds

# Probability
heart %>%
  count(high_risk) %>%
  mutate(prob = n/sum(n))

P(not high risk) = 0.848

# Odds
0.848/0.152

The odds that a person is not high risk of getting heart disease are about 5.58.

Logistic Regression Model

#glm function for logistic regression model
heart_model <- glm(high_risk ~ totChol, data = heart, family = "binomial")

tidy(heart_model)

log-odds(high-risk) = -2.894 + 0.005 * totChol

Predictions

Based on the model, if a randomly selected person has a total cholesterol of 225:

# log-odds they have a high risk of heart disease
x0 <- tibble(totChol = 225)
predict(heart_model, x0)
# odds they have a high risk of heart disease
odds <- exp(-1.796071)
odds
# probability they have a high risk of heart disease
prob <- odds/(1 + odds)
prob

Randomly selected

Based on the model, if a randomly selected person has a total cholesterol of 226

# log-odds they have a high risk of heart disease
x1 <- tibble(totChol = 226)
log_odds_x1 <- predict(heart_model, x1)
log_odds_x1
# odds they have a high risk of heart disease
odds_x1 <- exp(-1.79119)
odds_x1
# probability they have a high risk of heart disease
prob_x1 <- odds_x1 / (1 + odds_x1)
prob_x1

Comparing Log odds & odds

  1. Based on your answers in the previous section, how do the log-odds change when going from a person with total cholesterol of 225 mg/dL to a person with total cholesterol of 226 mg/dL?

log-odds increase by 0.00488 (coefficient for total cholestral from logistic regression model)

  1. How would you interpret the coefficient of totlChol in terms of the log-odds?

For every one mg/dL increase in totChol the expected log-odds of being high risk are predicted to increase by 0.00488

  1. What would be the interpretation of totChol in terms of the odds?

For every one mg/dL increase in totChol the expected odds are multiplied by 1.005.

Logistic Regression: Odds Ratios

Notes

Compare the odds for 2 groups

Review: odds = probability of success / probability of failure
♥ OR ♥
# of successes / # of failures

Odds Ratio (OR)

\[OR = \frac{odds_1}{odds_2} = \frac{\omega_1}{\omega_2}\] "The odds of ___ are ___ times higher for those with ___ than those with ___"

Categorical Predictor

Log-odds, interpret to the baseline group
"The odds of ___ are expected to be \(exp(\hat{\beta_j})\) times the odds of ___"

\[OR = exp(\hat{\beta_j})\]

Quantitative Predictor

“For each additional ____, the odds of ___ are expected to multiply by a factor of \(exp(\hat{\beta_j})\)

Multiple Predictors

"The odds of ___ for __ are expected to be \(exp(\hat{\beta_j})\) times the odds for ___, holding all else constant"

AE

!(Users/courtneylee/Documents/STAT210/Quiz 3/1.png)

Logistic Regression: Inference

Notes

Hypothesis Test for B_j

\[H_0:B_j=0, H_A:B_j \neq 0\]

Test Statistic

For proortions or probabilities, use z test statistic \[z=\frac{\hat{\beta_j}-0}{SE_\hat{\beta_j}} \\ \text{p-value: } p(\lvert Z \rvert > \lvert z \rvert) \text{(z in standard normal distribution)}\]

Confidence Interval for B_j

\[B_j \pm z^*SE_\hat{\beta_j}\] Log odds!!! “For every 1 unit increase in x the LOG ODDS…”
Interpret for odds!!
\[CI: exp(B_j \pm z^*SE_\hat{\beta_j})\]
“We are C% Confident that for every one unit change in \(x_j\), the odds multiply by a factor of \(exp(B_j - z^*SE_\hat{\beta_j})\) to \(exp(B_j + z^*SE_\hat{\beta_j})\), holding all else constant.”

Comparing Models

Log liklihood (log L): Measure of how well the model fits the data

  • Higher values are better
  • Deviance = -2logL, follows \(\chi^2\) distribution with n-p-1 degrees of freedom (p is number of predictors)

Comparing Nested Models

\[H_0: \beta_{q+1}=...=\beta_p=0 \\ H-A: \text{at least 1 } \beta_j \text{ is not 0}\] Use Drop in Deviance Test!!
\[G=(-2logL_{reduced})-(-2logL_{full}) \\ \text{p-value: } P(\chi^2>G)\] * use chi squared distribution with dof = # parameters in full - # parameters in reduced

Should we add “\(x_i\)” to this model?

If P value from chisq test is very small,
Conclusion: “The p-value is very small, so we reject the null hypothesis. The data provide sufficient evidence that the coefficient of ___ is not equal to 0. Therefore, we should add it to the model.”
* We can also use anova function to conduct test

Model Selection

Use AIC or BIC, choose model with smaller AIC or BIC

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)

leukemia <- read_csv("data/leukemia.csv")

Fit a model using Age to predict whether or not a patient responded to the treatment.

leukemia <- leukemia %>%
  mutate(resp = factor(Resp))

model1 <- glm(resp ~ Age, data = leukemia, family = "binomial")

tidy(model1) %>%
  kable(digits = 3)

For every 1 year increase in age, the odds a patient responded to the treatment are expected to. multiply by a factor of 0.954.
We have a z test statistic. The distribution of the z test statistic is N(0,1), the standard normal distribution.
Our p-value is 0.017 < 0.05, so we reject H0. The data provides sufficient evidence that the coefficient for Age is not equal to 0. Age is a statistically significant predictor of the odds a patient responds to the treatment

All pre-treatment variables

full_model <- glm(resp ~ Age + Smear + Infil + Index + Blasts + Temp, data = leukemia, family = "binomial")

tidy(full_model) %>%
  kable(digits = 3)

Age, index, and temp are statistically significant because they have small p-values.
Think about Age: “Statistically significant” that Age is a significant predictor of the odds a patient responds to the treatment, even after accounting for the all the other pre-treatment variables.

Drop in Deviance Test

reduced_model <- glm(resp ~ Age + Index + Temp, data = leukemia, family = "binomial")

tidy(full_model) %>%
  kable(digits = 3)
anova(reduced_model, full_model, test = "Chisq") %>%
  tidy()

For my model, I would choose the one that only has the significant predictors, because the p-value of the drop-in-deviance test is large.

Lab & HW

Lab 6

Load packages

library(tidyverse)
library(broom)
library(knitr)
gss <- read_csv("https://sta210-fa20.netlify.app/data/gss2016.csv",
  na = c("", "Don't know", "No answer", 
         "Not applicable"), 
         guess_max = 2867) %>%
  select(natmass, age, sex, sei10, region, polviews) %>%
  drop_na()

Exercise 1

The goal of the analysis is to understand the factors that are associated with a person being satisfied with the current spending on mass transportation. Create a new variable that is equal to “1” if a person said spending on mass transportation is about right and “0” otherwise.

# creates new variable
gss <- gss %>%
  mutate(satisfied = if_else(natmass == "About right", "1", "0"))

Exercise 2

Recode polviews so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.

Make a plot of the distribution of polviews. Which political view occurs most frequently in this data set?

# re-levels polviews
gss <- gss %>%
  mutate(polviews = factor(polviews,
                           levels = c("Extremely liberal", "Liberal", "Slightly liberal",
                                      "Moderate", "Slghtly conservative", "Conservative",
                                      "Extrmly conservative")))

# plots polview distribution
ggplot(data = gss, aes(x = polviews)) +
  geom_bar() +
  labs(x = "Political View",
       y = "Count",
       title = "Distribution of Political Views")

Moderate views occur most frequently in this dataset.

Exercise 3

Make a plot displaying the relationship between satisfaction with mass transportation spending and political views. Use the plot to describe the relationship between a person’s political views and whether they are satisfied with spending on mass transportation.

# plots distribution of polviews vs satisfied
ggplot(data = gss, aes(x = polviews, fill = satisfied)) +
  geom_bar(position = "fill") +
  labs(x = "Political View",
       y = "Satisfaction with Transportation Spending",
       title = "Satisfaction with Spending vs. Political Views")

Exercise 4

We’d like to use age as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as “89 or older”. Recode age so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values “89 or older” with a single value.

# makes age numeric
gss <- gss %>%
  mutate(age = if_else(age == "89 or older", '89', age)) %>%
  mutate(age = as.numeric(age))

Exercise 5

Briefly explain why we should use a logistic regression model to predict the odds a randomly selected person is satisfied with spending on mass transportation.

We should use a logistic regression model because our response variable is binary and we need a logistic regression model to only produce predictions between 0 and 1 as we want to predict probability.

Exercise 6

Let’s start by fitting a model using the demographic factors - age, sex, sei10, and region - to predict the odds a person is satisfied with spending on mass transportation. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation.

# mean-centers variables
gss <- gss %>%
  mutate(satisfied = as.numeric(satisfied)) %>%
  mutate(age = age - mean(age)) %>%
  mutate(sei10 = sei10 - mean(sei10))

# creates model
gss_model <- glm(satisfied ~ age + sex + sei10 + region, data = gss, family = "binomial")
tidy(gss_model) %>%
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 0.346 0.103 3.348 0.001
age -0.006 0.002 -2.725 0.006
sexMale -0.261 0.080 -3.247 0.001
sei10 -0.006 0.002 -3.561 0.000
regionE. sou. central -0.168 0.180 -0.936 0.349
regionMiddle atlantic 0.061 0.154 0.395 0.693
regionMountain -0.137 0.167 -0.824 0.410
regionNew england -0.652 0.188 -3.465 0.001
regionPacific -0.384 0.142 -2.708 0.007
regionSouth atlantic -0.048 0.131 -0.367 0.713
regionW. nor. central -0.055 0.181 -0.304 0.761
regionW. sou. central 0.181 0.159 1.142 0.253

Exercise 7

Interpret the intercept in the context of the data. Include any relevant values in your response.

The odds that a person said that the spending on mass transportation is about right is 1.413 (exp(0.346)) if the person has the mean age, is female, has the mean socioeconomic status, and is interviewed in the East North Central region.

Exercise 8

Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of being satisfied with spending on mass transportation.

For every one year increase in age, odds of being satisfied on spending on mass transportation are expected to multiply by a factor of exp(-0.006) = 0.994, holding all else constant.

Exercise 9

Now let’s see whether a person’s political views has a significant impact on their odds of being satisfied with spending on mass transportation, after accounting for the demographic factors.

Conduct a drop-in-deviance test to determine if polviews is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypotheses in words, display all relevant code and output, and state your conclusion in the context of the problem.

# creates model
gss_pol <- glm(satisfied ~ age + sex + sei10 +
                 region + polviews, data = gss, family = "binomial")
tidy(gss_pol, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.423 0.208 -2.035 0.042 -0.837 -0.020
age -0.008 0.002 -3.349 0.001 -0.012 -0.003
sexMale -0.277 0.082 -3.392 0.001 -0.437 -0.117
sei10 -0.005 0.002 -2.739 0.006 -0.008 -0.001
regionE. sou. central -0.204 0.182 -1.120 0.263 -0.560 0.153
regionMiddle atlantic 0.099 0.156 0.633 0.527 -0.206 0.405
regionMountain -0.097 0.169 -0.575 0.565 -0.427 0.234
regionNew england -0.562 0.191 -2.943 0.003 -0.940 -0.190
regionPacific -0.357 0.144 -2.483 0.013 -0.639 -0.076
regionSouth atlantic -0.050 0.133 -0.379 0.705 -0.310 0.210
regionW. nor. central -0.062 0.183 -0.339 0.735 -0.421 0.298
regionW. sou. central 0.147 0.161 0.918 0.359 -0.167 0.464
polviewsLiberal 0.248 0.217 1.144 0.253 -0.174 0.678
polviewsSlightly liberal 0.606 0.220 2.754 0.006 0.178 1.042
polviewsModerate 0.926 0.196 4.716 0.000 0.546 1.317
polviewsSlghtly conservative 0.845 0.214 3.949 0.000 0.430 1.270
polviewsConservative 0.967 0.212 4.552 0.000 0.555 1.390
polviewsExtrmly conservative 1.314 0.281 4.675 0.000 0.770 1.873
# drop-in-deviance test
anovamodel <- anova(gss_model, gss_pol, test = "Chisq")
tidy(anovamodel) %>%
  kable(digits = 3)
Resid..Df Resid..Dev df Deviance p.value
2578 3518.199 NA NA NA
2572 3460.786 6 57.413 0

Null Hypothesis: The coefficients associated with political views are all zero.

Alternative Hypothesis: At least one of the coefficients associated with political views is not zero.

Conclusion: The p-value is near zero, so there is sufficient evidence to reject the null hypothesis. The data provides sufficient evidence that at least one of the coefficients associated with political views is not zero, and political views is a significant predictor of the attitude towards spending on mass transportation.

Exercise 10

Use the model to describe the relationship between one’s political views and their attitude towards spending on mass transportation. Be sure your answer includes the interpretation of the model coefficients and associated hypothesis tests or confidence intervals used to support your response.

* The odds of being satisfied on spending on mass transportation for liberal views are expected to be exp(0.248) = 1.281 times the odds for extremely liberal views, holding all else constant. The p-value is greater than 0.05, which means that there is not sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is liberal.

  • The odds of being satisfied on spending on mass transportation for slightly liberal views are expected to be exp(0.606) = 1.833 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is slightly liberal.

  • The odds of being satisfied on spending on mass transportation for moderate views are expected to be exp(0.926) = 2.524 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is moderate.

  • The odds of being satisfied on spending on mass transportation for slightly conservative views are expected to be exp(0.845) = 2.328 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is slightly conservative.

  • The odds of being satisfied on spending on mass transportation for conservative views are expected to be exp(0.967) = 2.630 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is conservative.

  • The odds of being satisfied on spending on mass transportation for extremely conservative views are expected to be exp(1.314) = 3.721 times the odds for extremely liberal views, holding all else constant. The p-value is less than 0.05, which means that there is sufficient evidence that a person’s attitude towards mass transportation spending differs if the person is extremely liberal compared to if the person is extremely conservative.

HW 5

Load packages & data

library(tidyverse)
library(broom)
library(knitr)
medGPA <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 3/MedGPA.csv")

Question 1

birth <- matrix(c(9584, 4792, 14376, 2.4, 1.1, 2), ncol=3, byrow=TRUE)
colnames(birth) <- c("With Defects","Without Defects", "Total")
rownames(birth) <- c("# babies","% mothers used IVF")
birth <- as.table(birth)
birth
##                    With Defects Without Defects   Total
## # babies                 9584.0          4792.0 14376.0
## % mothers used IVF          2.4             1.1     2.0
IVF <- matrix(c(230, 9354, 9584, 53, 4739, 4792, 283, 14093, 14376), ncol = 3, byrow=TRUE)
colnames(IVF) <- c("Used IVF", "Did not use IVF", "Total")
rownames(IVF) <- c("# of babies with defects", "# of babies without defects", "Total")
IVF <- as.table(IVF)
IVF
##                             Used IVF Did not use IVF Total
## # of babies with defects         230            9354  9584
## # of babies without defects       53            4739  4792
## Total                            283           14093 14376

Question 2

Odds of a birth defect = Total # babies with defects/Total # babies without defects

# Odds of a birth defect

9584/4792
## [1] 2

The odds that a baby in this study was born with a birth defect is 2.

Question 3

# Odds of having baby with defect
.03/.97
## [1] 0.03092784

The odds of having a baby with a birth defect is 0.031 in reality

# Ratio of odds of birth defect in the study and odds of birth defect in reality
2/0.03092784
## [1] 64.66666

The odds of a baby having a birth defect in the study is 64.7 times higher than the odds of a baby having a birth defect in reality.

Find the odd ratio of the odds of a baby having a birth defect from mothers who used IVF to the odds of a baby having a birth defect from mothers who did not use an IVF.

Odds of a birth defect from mothers who use IVF = 230/53 = 4.34

Odds of a birth defect from mothers who did not use IVF = 9354/4739 = 1.97

Odds ratio = 4.34/1.97 = 2.20

The odds of a baby with a birth defect is 2.20 times higher for a mother using IVF that a mother not using an IVF. However, we can not conclude that this data supports the finding’s conclusion because the study has a disproportionately large number of babies who had defects. As seen in the odds ratio in the beginning of question 3, the odds of having a baby with a birth defect in the study is much greater than the odds in reality. Based on the data from the findings, I would conclude that the IVF DOES carry excessive risks, which does not support the researcher’s conclusion.

Question 4

acceptance <- glm(Acceptance ~ MCAT + GPA, data = medGPA, family = "binomial")

tidy(acceptance)
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -22.4       6.45      -3.47 0.000527
## 2 MCAT           0.165     0.103      1.59 0.111   
## 3 GPA            4.68      1.64       2.85 0.00439

log-odds(acceptance) = -22.37 + 0.16 * MCAT + 4.68 * GPA

  • For every 1 point increase in the MCAT score, the odds of acceptance into medical school are expected to multiply by 1.179, holding the GPA constant.
  • For every 1 point increase in the GPA, the odds of acceptance into medical school are expected to multiply by 107.389, holding the MCAT score constant.

Increasing GPA by 1 point increases the odds of being accepted into medical school much more than increasing the MCAT score by one point. However, 1 whole GPA score increase might be much more difficult to achieve than 1 point increase in the MCAT score. To make this interpretation more useful, we can state that for every 0.01 point increase in GPA, the odds of acceptance into medical school are expected to multiply by 1.048.

However, because the GPA p-value is significant and the MCAT p-value is not significant, we can infer that improving GPA will have a bigger effect on the odds of acceptance relative to the MCAT score.

Question 5

acceptance2 <- glm(Acceptance ~ MCAT + GPA + Apps, data = medGPA, family = "binomial")

tidy(acceptance2, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -23.689 7.024 -3.373 0.001 -39.590 -11.662
MCAT 0.173 0.105 1.641 0.101 -0.021 0.398
GPA 4.861 1.694 2.869 0.004 1.839 8.576
Apps 0.044 0.076 0.575 0.565 -0.106 0.198

\(H_0: \beta_{apps} = 0 \\ H_A: \beta_{apps} \neq 0\)

The null hypothesis states that the number of applications is not a predictor for odds of acceptance, and the alternative hypothesis states that the number of applications is not a predictor for odds of acceptance The p-value of \(\beta_{apps}\) is 0.565 > 0.05. Therefore we fail to reject the null hypothesis, and there is not sufficient evidence from the data that the number of applications predict acceptance into medical school after accounting for MCAT score and GPA.

Question 6

\(H_0: \beta_{q+1} = ... = \beta_p = 0 \\ H_A: \text{at least one } \beta_{j} \neq 0\)

The null hypothesis states that the coefficients in the full model as well as the interactions between sex and MCAT and sex and GPA are not significant additions to the model. The alternative hypothesis states that at least one of the coefficients in the full model is a significant predictor addition to the model.

model_reduced <- glm(Acceptance ~ MCAT + GPA, data = medGPA, family = "binomial")
model_full <- glm(Acceptance ~ MCAT + GPA + Sex + Sex * MCAT + Sex * GPA,
                  data = medGPA, family = "binomial")
tidy(model_full) %>%
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -24.424 10.037 -2.433 0.015
MCAT 0.024 0.147 0.160 0.873
GPA 6.857 3.185 2.153 0.031
SexM -6.155 16.407 -0.375 0.708
MCAT:SexM 0.326 0.237 1.376 0.169
GPA:SexM -1.948 4.162 -0.468 0.640
anova(model_reduced, model_full, test = "Chisq") %>%
  tidy()
## # A tibble: 2 x 5
##   Resid..Df Resid..Dev    df Deviance p.value
##       <dbl>      <dbl> <dbl>    <dbl>   <dbl>
## 1        52       54.0    NA    NA     NA    
## 2        49       48.6     3     5.39   0.146

The p-value is 0.146 > 0.05. Therefore, we reject the null hypothesis and conclude that there is not enough evidence from the data to conclude that after accounting for GPA and MCAT score, sex and interactions between sex and GPA and MCAT are not significant additions as predictors for odds of medical school acceptance. According to this drop in deviance model, the reduced model which just includes GPA and MCAT is the best model. There is also no evidence that the effect of MCAT score or GPA differs for males and females since these are not significant predictors in the model.