Quick Code Reference

Study Guide

The best way to prepare for this open note exam is to have an organized, clear, exhaustive study guide that provides answers to all the questions you might be asked to do on your final. We have covered 3 main topics:

Multiple Regression (with categorical variables)
Logistic Regression
Data Management

Determine the types of information that you will need to know for each of these and then develop a study guide.

Multiple Regression

y = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ + ε

Step 1: Create scatter plots to examine the relationships between these variables.

pairs(df)
To run a regression model, we assume that the relationship between the independent and dependent variables are linear.

Step 2: Run regression

model1 <- lm(dependent ~ independent + 2independent, data = df)
summary(model1)
The results of the multiple regression model suggest that both independent and 2independent are significant predictors of dependent.
For independent, the estimated coefficient is ###, with a standard error of ###, a t-value of ###, and a p-value of ###. Since the p-value is less than the significance level of 0.05, we can conclude that independent significantly predicts dependent.
For 2independent, the estimated coefficient is ###, with a standard error of ###, a t-value of ###, and a p-value of ###. As the p-value is also less than the significance level of 0.05, we can conclude that 2independent significantly predicts dependent.

Step 3: Assumptions: In a multiple regression, we are interested in residuals.

plot(model1) function gives 4 plots.

1st plot: The residuals need to be similar across (linear relationship)
2nd Plot: Normal distribution of residuals (QQ Plot)
3rd Plot: Homogeneity of variance, the standard deviation needs to be the same across all of the values of Y.

Step 4: Cooks distance: test for outlier or influencial points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.

Step 5: Residuals need to be independent of eachother: Durbine Watson test dwt(model1) We want the Durbine Watson statistic to be approximately 2. Values less than 1 or greater than 3 are problematic.

Step 6: Standardize residuals: should be normally distributed. plot(rstandard(model1) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)

Step 7: Multicollinearity: independent variables are highly correlated with each other.vif(model1)

Both values are less than 2.5, therefore I conclude that the predictors are not highly correlated with each other.

Step 8: Interpret Findings

The results show a negative association between independent and the dependent. This implies that, on average, counties with higher independent tend to have dependent, while controlling for other factors in the model.

There is a negative association between 2independent and the dependent of suicide. This suggests that, on average, counties with higher 2independent levels tend to have lower dependent, while controlling for other factors in the model.

The results show a positive association between 3independent and the dependent. This indicates that, on average, 3independent tend to have higher dependent, while controlling for other factors in the model.

There is a positive association between the 4independent and the dependent . This suggests that, on average, counties with a higher proportion of 4independent tend to have higher dependent, while controlling for other factors in the model.

Logistic Regression

Step 1: Visualize the Data

plot(df$x, as.numeric(df$y))
df %>% group_by(y) %>% summarize(mean_x = mean(x))

Step 2: use the function glm()to run the logistic regression and store the regression in a name like “model”.

model1 <- glm(dependent ~ independent + 2independent, family = binomial, data=df)
independent is a significant predictor of condom use. The p-value associated with independent is ### (indicated by ***), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between independent description and dependent.
2independent is not a significant predictor of condom use. The p-value associated with 2independent is ### (indicated by not having a *), which is greater than the significance level of 0.05. This suggests that there is not a statistically significant relationship between 2independent description and dependent.

Step 3: If you add more than one independent, run and store the regression for each. Then determine the best model by comparing the AIC values. anova()

model2 <- glm(dependent ~ independent + 2independent + 3independent, family = binomial, data=df)
anova(model1, model2, test="Chisq")

Here’s an interpretation of the output:

Resid. Df: The residual degrees of freedom for Model 1 and Model 2 are #a# and #b#, respectively.
Resid. Dev: The residual deviance for Model 1 (##) and Model 2 (##) measures the goodness-of-fit of each model. Lower values indicate a better fit.
Df: The difference in degrees of freedom between the two models (#a# - #b#) is 1. This is because Model 2 has one (more) predictor variable (3independent) compared to Model 1.
Deviance: The difference in residual deviance between the two models is #d#. This value represents the reduction in residual deviance when moving from Model 1 to Model 2.

Pr(>Chi): The p-value for the test is #p#, which is large (greater than than 0.05). This suggests that there is not a statistically significant difference between the two models, and Model 2 (with 3independent as an additional predictor variable) does not provide a significantly better fit to the data than Model 1.

Based on the analysis, Model 2 does not provide a significantly better fit to the data compared to Model 1, as the p-value (#p#) is greater than the significance level of 0.05. Therefore, adding the 3independent variable to the model does not lead to a significant improvement in model fit.

In this situation, I consider using the simpler model (Model 1), as it has fewer predictor variables while still providing a similar level of fit to the data. Additionally, in Model 1, both independent and 2independent are significant predictors of dependent, suggesting that these variables have a statistically significant relationship with dependent description.

Step 4: The assumptions to test are:

Cooks distance: test for outlier or influential points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.
Standardize residuals: should be normally distributed. plot(rstandard(model1)) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)
Multicollinearity: independent variables are highly correlated with each other.vif(model1) Less than 2.5 is great! Real problem if you are over 10. Note: library(car)

Step 5: Calculate the odds ratio exp(coef(model1)). The value for the independent variables are the odds. A value greater than 1 means that as the predictor increases, the odds of the outcome also increases. A value less than 1 means that as the predictor increases, the odds of the outcome occurring decreases. Near one means it is not a great predictor outcome variable.

The odds ratio for independent is #e#. This means that the odds of dependent for “factor in independent” are about #e# times higher than for “other factor in independent”, keeping all other variables constant. However, the independent variable was not statistically significant in Model 1, so I cannot confidently claim that there is a significant relationship between independent and dependent.
The odds ratio for 2independent is #f#. This means that for each one-unit increase in the 2independent score, the odds of dependent decrease by about((1 - *#f#*) * 100), keeping all other variables constant. Since the p-value for 2independent is significant in Model 1, I conclude that there is a statistically significant relationship between 2independent and dependent.
The odds ratio for 3independent is #g#. This means that for each one-unit increase in the 3independent risk score, the odds of dependent increase by about ((*#g#* - 1) * 100), keeping all other variables constant. As the p-value for 3independent is significant in Model 1, I conclude that there is a statistically significant relationship between 3independent and dependent.

Step 6: Pseudo-R² (or McFadden’s R²) multiplied by 100. This indicates that the model with the specified independent variables in the model explains the percent of the variation compared to the null model.

library(pscl)
pR2(model1)["McFadden"]

Step 7: Stat Communication Block

A logistic regression was conducted to determine factors that influence dependent. The results reveal that 2independent and 3independent are significant predictors of dependent (p < 0.05). For each one-unit increase in 2independent (2independent description), the odds of dependent change by about ###% (Odds Ratio = #f#). For each one-unit increase in 3independent (3independent description), the odds of dependent change by about ###% (Odds Ratio = #g#). However, independent (independent description) is not a significant predictor of dependent (p = #?#), and the odds of dependent for one group are about #??# times higher than for the other group, although this difference is not statistically significant. The McFadden’s R-squared value for the model is not percent, indicating that the model with independent, 2independent, and 3independent variables explains #percent#% of the variation in dependent compared to the null model.

Data Management

Recode, new columns

select() columns filter() rows as.numeric() make the factor class a numeric log10() log transformation; add +1

df <- df %>% mutate(newcolm = ifelse(as.numeric(oldcolm) >= 10, NA, oldcolm), newcolm2 = ifelse(oldcolm2 == 98, NA, oldcolm2), newcolm3 = log10(oldcolm3 + 1)) %>% filter(as.numeric(ar09) >= 20 & as.numeric(ar09) <= 100) %>% select(br15, educ, ar09, religion, log_dowry, kw23b)

rename the column: df<- df %>% rename(old_name = new_name)

For more on data management, see the “Data Management & Multiple Regression in Practice” document.

Quick Code Reference

Lorraine Gaudio

2023-05-01

Study Guide

Multiple Regression

Logistic Regression

Data Management