The best way to prepare for this open note exam is to have an organized, clear, exhaustive study guide that provides answers to all the questions you might be asked to do on your final. We have covered 3 main topics:
Multiple Regression (with categorical variables)
Logistic Regression
Data Management
Determine the types of information that you will need to know for each of these and then develop a study guide.
y = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ + ε
Step 1: Create scatter plots to examine the relationships between these variables.
pairs(df)
To run a regression model, we assume that the relationship between the independent and dependent variables are linear.
Step 2: Run regression
model1 <- lm(dependent ~ independent + 2independent, data = df)
summary(model1)
The results of the multiple regression model suggest that both
independent and 2independent are significant
predictors of dependent.
For independent, the estimated coefficient is
###, with a standard error of ###, a t-value of
###, and a p-value of ###. Since the p-value is less
than the significance level of 0.05, we can conclude that
independent significantly predicts
dependent.
For 2independent, the estimated coefficient is
###, with a standard error of ###, a t-value of
###, and a p-value of ###. As the p-value is also less
than the significance level of 0.05, we can conclude that
2independent significantly predicts
dependent.
Step 3: Assumptions: In a multiple regression, we are interested in residuals.
plot(model1) function gives 4 plots.1st plot: The residuals need to be similar across (linear relationship)
2nd Plot: Normal distribution of residuals (QQ Plot)
3rd Plot: Homogeneity of variance, the standard deviation needs to be the same across all of the values of Y.
Step 4: Cooks distance: test for outlier or influencial points.
max(cooks.distance(model1)) gives the highest value. Cook’s
> 1 is problematic.
Step 5: Residuals need to be independent of eachother: Durbine Watson
test dwt(model1) We want the Durbine Watson statistic to be
approximately 2. Values less than 1 or greater than 3 are
problematic.
Step 6: Standardize residuals: should be normally distributed.
plot(rstandard(model1) indicates that a data point is
really far away from what is predicted. Greater than 3 could be
problematic if you have a small dataset or more than 1% of the data
greater than 3. sum(rstandard(model1)>3)
Step 7: Multicollinearity: independent variables are highly
correlated with each other.vif(model1)
Step 8: Interpret Findings
The results show a negative association between
independent and the dependent. This implies
that, on average, counties with higher independent tend to
have dependent, while controlling for other factors in the
model.
There is a negative association between 2independent and
the dependent of suicide. This suggests that, on average,
counties with higher 2independent levels tend to have lower
dependent, while controlling for other factors in the
model.
The results show a positive association between
3independent and the dependent. This indicates
that, on average, 3independent tend to have higher
dependent, while controlling for other factors in the
model.
There is a positive association between the 4independent
and the dependent . This suggests that, on average,
counties with a higher proportion of 4independent tend to
have higher dependent, while controlling for other factors
in the model.
Step 1: Visualize the Data
plot(df$x, as.numeric(df$y))
df %>% group_by(y) %>% summarize(mean_x = mean(x))
Step 2: use the function glm()to run the logistic
regression and store the regression in a name like “model”.
model1 <- glm(dependent ~ independent + 2independent, family = binomial, data=df)
independent is a significant predictor of condom
use. The p-value associated with independent is
### (indicated by ***), which is less than the
significance level of 0.05. This suggests that there is a statistically
significant relationship between independent description and
dependent.
2independent is not a significant predictor of
condom use. The p-value associated with 2independent is
### (indicated by not having a *), which is
greater than the significance level of 0.05. This suggests that there is
not a statistically significant relationship between 2independent
description and dependent.
Step 3: If you add more than one independent, run and store the
regression for each. Then determine the best model by comparing the AIC
values. anova()
model2 <- glm(dependent ~ independent + 2independent + 3independent, family = binomial, data=df)
anova(model1, model2, test="Chisq")
Here’s an interpretation of the output:
Resid. Df: The residual degrees of freedom for Model 1 and Model 2 are #a# and #b#, respectively.
Resid. Dev: The residual deviance for Model 1 (##) and Model 2 (##) measures the goodness-of-fit of each model. Lower values indicate a better fit.
Df: The difference in degrees of freedom between the two models
(#a# - #b#) is 1. This is because Model 2 has one
(more) predictor variable (3independent) compared to Model
1.
Deviance: The difference in residual deviance between the two
models is #d#. This value represents the reduction in
residual deviance when moving from Model 1 to Model 2.
Pr(>Chi): The p-value for the test is #p#, which is large
(greater than than 0.05). This suggests that there is
not a statistically significant difference between the two
models, and Model 2 (with 3independent as an additional
predictor variable) does not provide a significantly better fit
to the data than Model 1.
Based on the analysis, Model 2 does not provide a
significantly better fit to the data compared to Model 1, as the p-value
(#p#) is greater than the significance level of 0.05.
Therefore, adding the 3independent variable to the model
does not lead to a significant improvement in model fit.
In this situation, I consider using the simpler model (Model
1), as it has fewer predictor variables while still providing a
similar level of fit to the data. Additionally, in Model 1,
both independent and 2independent are
significant predictors of dependent, suggesting that these
variables have a statistically significant relationship with
dependent description.
Step 4: The assumptions to test are:
Cooks distance: test for outlier or influential points.
max(cooks.distance(model1)) gives the highest value. Cook’s
> 1 is problematic.
Standardize residuals: should be normally distributed.
plot(rstandard(model1)) indicates that a data point is
really far away from what is predicted. Greater than 3 could be
problematic if you have a small dataset or more than 1% of the data
greater than 3. sum(rstandard(model1)>3)
Multicollinearity: independent variables are highly correlated
with each other.vif(model1) Less than 2.5 is great! Real
problem if you are over 10. Note: library(car)
Step 5: Calculate the odds ratio exp(coef(model1)). The
value for the independent variables are the odds. A value greater than 1
means that as the predictor increases, the odds of the outcome also
increases. A value less than 1 means that as the predictor increases,
the odds of the outcome occurring decreases. Near one means it is not a
great predictor outcome variable.
The odds ratio for independent is #e#. This
means that the odds of dependent for “factor in
independent” are about #e# times higher than for
“other factor in independent”, keeping all other variables
constant. However, the independent variable was not
statistically significant in Model 1, so I cannot confidently claim that
there is a significant relationship between independent and
dependent.
The odds ratio for 2independent is #f#.
This means that for each one-unit increase in the
2independent score, the odds of dependent
decrease by about((1 - *#f#*) * 100), keeping all
other variables constant. Since the p-value for
2independent is significant in Model 1, I conclude that
there is a statistically significant relationship between
2independent and dependent.
The odds ratio for 3independent is #g#.
This means that for each one-unit increase in the
3independent risk score, the odds of dependent
increase by about ((*#g#* - 1) * 100), keeping all other
variables constant. As the p-value for 3independent is
significant in Model 1, I conclude that there is a statistically
significant relationship between 3independent and
dependent.
Step 6: Pseudo-R2 (or McFadden’s R2) multiplied by 100. This indicates that the model with the specified independent variables in the model explains the percent of the variation compared to the null model.
library(pscl)
pR2(model1)["McFadden"]
Step 7: Stat Communication Block
A logistic regression was conducted to determine factors that
influence dependent. The results reveal that
2independent and 3independent are significant
predictors of dependent (p < 0.05). For each one-unit
increase in 2independent (2independent
description), the odds of dependent change by about ###%
(Odds Ratio = #f#). For each one-unit increase in
3independent (3independent description), the odds
of dependent change by about ###% (Odds Ratio =
#g#). However, independent (independent
description) is not a significant predictor of dependent (p =
#?#), and the odds of dependent for one group are about
#??# times higher than for the other group, although this
difference is not statistically significant. The McFadden’s R-squared
value for the model is not percent, indicating that the model
with independent, 2independent, and
3independent variables explains #percent#% of the
variation in dependent compared to the null model.
Recode, new columns
select() columns filter() rows
as.numeric() make the factor class a numeric
log10() log transformation; add +1
df <- df %>%
mutate(newcolm = ifelse(as.numeric(oldcolm) >= 10, NA, oldcolm),
newcolm2 = ifelse(oldcolm2 == 98, NA, oldcolm2),
newcolm3 = log10(oldcolm3 + 1)) %>%
filter(as.numeric(ar09) >= 20 & as.numeric(ar09) <= 100) %>%
select(br15, educ, ar09, religion, log_dowry, kw23b)
rename the column:
df<- df %>% rename(old_name = new_name)
For more on data management, see the “Data Management & Multiple Regression in Practice” document.