Study Guide

The best way to prepare for this open note exam is to have an organized, clear, exhaustive study guide that provides answers to all the questions you might be asked to do on your final. We have covered 3 main topics:

  1. Multiple Regression (with categorical variables)

  2. Logistic Regression

  3. Data Management

Determine the types of information that you will need to know for each of these and then develop a study guide.

Multiple Regression

y = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ + ε

Step 1: Create scatter plots to examine the relationships between these variables.

Step 2: Run regression

Step 3: Assumptions: In a multiple regression, we are interested in residuals.

  1. 1st plot: The residuals need to be similar across (linear relationship)

  2. 2nd Plot: Normal distribution of residuals (QQ Plot)

  3. 3rd Plot: Homogeneity of variance, the standard deviation needs to be the same across all of the values of Y.

Step 4: Cooks distance: test for outlier or influencial points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.

Step 5: Residuals need to be independent of eachother: Durbine Watson test dwt(model1) We want the Durbine Watson statistic to be approximately 2. Values less than 1 or greater than 3 are problematic.

Step 6: Standardize residuals: should be normally distributed. plot(rstandard(model1) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)

Step 7: Multicollinearity: independent variables are highly correlated with each other.vif(model1)

Step 8: Interpret Findings

The results show a negative association between independent and the dependent. This implies that, on average, counties with higher independent tend to have dependent, while controlling for other factors in the model.

There is a negative association between 2independent and the dependent of suicide. This suggests that, on average, counties with higher 2independent levels tend to have lower dependent, while controlling for other factors in the model.

The results show a positive association between 3independent and the dependent. This indicates that, on average, 3independent tend to have higher dependent, while controlling for other factors in the model.

There is a positive association between the 4independent and the dependent . This suggests that, on average, counties with a higher proportion of 4independent tend to have higher dependent, while controlling for other factors in the model.

Logistic Regression

Step 1: Visualize the Data

Step 2: use the function glm()to run the logistic regression and store the regression in a name like “model”.

Step 3: If you add more than one independent, run and store the regression for each. Then determine the best model by comparing the AIC values. anova()

Here’s an interpretation of the output:

Pr(>Chi): The p-value for the test is #p#, which is large (greater than than 0.05). This suggests that there is not a statistically significant difference between the two models, and Model 2 (with 3independent as an additional predictor variable) does not provide a significantly better fit to the data than Model 1.

Based on the analysis, Model 2 does not provide a significantly better fit to the data compared to Model 1, as the p-value (#p#) is greater than the significance level of 0.05. Therefore, adding the 3independent variable to the model does not lead to a significant improvement in model fit.

In this situation, I consider using the simpler model (Model 1), as it has fewer predictor variables while still providing a similar level of fit to the data. Additionally, in Model 1, both independent and 2independent are significant predictors of dependent, suggesting that these variables have a statistically significant relationship with dependent description.

Step 4: The assumptions to test are:

  1. Cooks distance: test for outlier or influential points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.

  2. Standardize residuals: should be normally distributed. plot(rstandard(model1)) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)

  3. Multicollinearity: independent variables are highly correlated with each other.vif(model1) Less than 2.5 is great! Real problem if you are over 10. Note: library(car)

Step 5: Calculate the odds ratio exp(coef(model1)). The value for the independent variables are the odds. A value greater than 1 means that as the predictor increases, the odds of the outcome also increases. A value less than 1 means that as the predictor increases, the odds of the outcome occurring decreases. Near one means it is not a great predictor outcome variable.

Step 6: Pseudo-R2 (or McFadden’s R2) multiplied by 100. This indicates that the model with the specified independent variables in the model explains the percent of the variation compared to the null model.

Step 7: Stat Communication Block

A logistic regression was conducted to determine factors that influence dependent. The results reveal that 2independent and 3independent are significant predictors of dependent (p < 0.05). For each one-unit increase in 2independent (2independent description), the odds of dependent change by about ###% (Odds Ratio = #f#). For each one-unit increase in 3independent (3independent description), the odds of dependent change by about ###% (Odds Ratio = #g#). However, independent (independent description) is not a significant predictor of dependent (p = #?#), and the odds of dependent for one group are about #??# times higher than for the other group, although this difference is not statistically significant. The McFadden’s R-squared value for the model is not percent, indicating that the model with independent, 2independent, and 3independent variables explains #percent#% of the variation in dependent compared to the null model.

Data Management

Recode, new columns

select() columns filter() rows as.numeric() make the factor class a numeric log10() log transformation; add +1

df <- df %>% mutate(newcolm = ifelse(as.numeric(oldcolm) >= 10, NA, oldcolm), newcolm2 = ifelse(oldcolm2 == 98, NA, oldcolm2), newcolm3 = log10(oldcolm3 + 1)) %>% filter(as.numeric(ar09) >= 20 & as.numeric(ar09) <= 100) %>% select(br15, educ, ar09, religion, log_dowry, kw23b)

rename the column: df<- df %>% rename(old_name = new_name)

For more on data management, see the “Data Management & Multiple Regression in Practice” document.