- Load your preferred dataset into R studio
- Create a linear model “lm()” from the variables, with a continuous
dependent variable as the outcome
- Check the following assumptions:
- Linearity (plot and raintest)
- Independence of errors (durbin-watson)
- Homoscedasticity (plot, bptest)
- Normality of residuals (QQ plot, shapiro test)
- No multicolinarity (VIF, cor)
- does your model meet those assumptions? You don’t have to be
perfectly right, just make a good case.
- If your model violates an assumption, which one?
- What would you do to mitigate this assumption? Show your work.
setwd("C:/Users/qfd738-4251/Desktop/My Class Stuff")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
#Load data (adjusted skip since headers off)
raw_data <-read_excel("2025-County-Health-Rankings-Texas-Data-v3.xlsx", sheet = "Select Measure Data", skip = 1)
## New names:
## • `Unreliable` -> `Unreliable...4`
## • `95% CI - Low` -> `95% CI - Low...7`
## • `95% CI - High` -> `95% CI - High...8`
## • `National Z-Score` -> `National Z-Score...9`
## • `95% CI - Low` -> `95% CI - Low...39`
## • `95% CI - High` -> `95% CI - High...40`
## • `National Z-Score` -> `National Z-Score...41`
## • `Unreliable` -> `Unreliable...42`
## • `95% CI - Low` -> `95% CI - Low...44`
## • `95% CI - High` -> `95% CI - High...45`
## • `National Z-Score` -> `National Z-Score...46`
## • `95% CI - Low` -> `95% CI - Low...69`
## • `95% CI - High` -> `95% CI - High...70`
## • `National Z-Score` -> `National Z-Score...71`
## • `95% CI - Low` -> `95% CI - Low...73`
## • `95% CI - High` -> `95% CI - High...74`
## • `National Z-Score` -> `National Z-Score...75`
## • `National Z-Score` -> `National Z-Score...77`
## • `National Z-Score` -> `National Z-Score...84`
## • `National Z-Score` -> `National Z-Score...86`
## • `National Z-Score` -> `National Z-Score...90`
## • `National Z-Score` -> `National Z-Score...94`
## • `National Z-Score` -> `National Z-Score...98`
## • `National Z-Score` -> `National Z-Score...100`
## • `National Z-Score` -> `National Z-Score...107`
## • `95% CI - Low` -> `95% CI - Low...115`
## • `95% CI - High` -> `95% CI - High...116`
## • `National Z-Score` -> `National Z-Score...117`
## • `95% CI - Low` -> `95% CI - Low...119`
## • `95% CI - High` -> `95% CI - High...120`
## • `National Z-Score` -> `National Z-Score...130`
## • `95% CI - Low` -> `95% CI - Low...132`
## • `95% CI - High` -> `95% CI - High...133`
## • `National Z-Score` -> `National Z-Score...134`
## • `95% CI - Low` -> `95% CI - Low...152`
## • `95% CI - High` -> `95% CI - High...153`
## • `National Z-Score` -> `National Z-Score...154`
## • `National Z-Score` -> `National Z-Score...156`
## • `National Z-Score` -> `National Z-Score...158`
## • `95% CI - Low` -> `95% CI - Low...161`
## • `95% CI - High` -> `95% CI - High...162`
## • `National Z-Score` -> `National Z-Score...163`
## • `National Z-Score` -> `National Z-Score...165`
## • `Population` -> `Population...167`
## • `95% CI - Low` -> `95% CI - Low...169`
## • `95% CI - High` -> `95% CI - High...170`
## • `National Z-Score` -> `National Z-Score...171`
## • `Population` -> `Population...173`
## • `95% CI - Low` -> `95% CI - Low...175`
## • `95% CI - High` -> `95% CI - High...176`
## • `National Z-Score` -> `National Z-Score...177`
## • `National Z-Score` -> `National Z-Score...181`
## • `National Z-Score` -> `National Z-Score...185`
## • `95% CI - Low` -> `95% CI - Low...187`
## • `95% CI - High` -> `95% CI - High...188`
## • `National Z-Score` -> `National Z-Score...189`
## • `95% CI - Low` -> `95% CI - Low...197`
## • `95% CI - High` -> `95% CI - High...198`
## • `National Z-Score` -> `National Z-Score...199`
## • `National Z-Score` -> `National Z-Score...223`
## • `National Z-Score` -> `National Z-Score...225`
addl_data<-read_excel("2025-County-Health-Rankings-Texas-Data-v3.xlsx", sheet = "Additional Measure Data", skip = 1)
## New names:
## • `95% CI - Low` -> `95% CI - Low...5`
## • `95% CI - High` -> `95% CI - High...6`
## • `# Deaths` -> `# Deaths...28`
## • `95% CI - Low` -> `95% CI - Low...30`
## • `95% CI - High` -> `95% CI - High...31`
## • `# Deaths` -> `# Deaths...53`
## • `95% CI - Low` -> `95% CI - Low...55`
## • `95% CI - High` -> `95% CI - High...56`
## • `# Deaths` -> `# Deaths...78`
## • `95% CI - Low` -> `95% CI - Low...80`
## • `95% CI - High` -> `95% CI - High...81`
## • `95% CI - Low` -> `95% CI - Low...104`
## • `95% CI - High` -> `95% CI - High...105`
## • `95% CI - Low` -> `95% CI - Low...107`
## • `95% CI - High` -> `95% CI - High...108`
## • `95% CI - Low` -> `95% CI - Low...112`
## • `95% CI - High` -> `95% CI - High...113`
## • `95% CI - Low` -> `95% CI - Low...115`
## • `95% CI - High` -> `95% CI - High...116`
## • `# Deaths` -> `# Deaths...117`
## • `95% CI - Low` -> `95% CI - Low...119`
## • `95% CI - High` -> `95% CI - High...120`
## • `95% CI - Low` -> `95% CI - Low...144`
## • `95% CI - High` -> `95% CI - High...145`
## • `95% CI - Low` -> `95% CI - Low...151`
## • `95% CI - High` -> `95% CI - High...152`
## • `95% CI - Low` -> `95% CI - Low...154`
## • `95% CI - High` -> `95% CI - High...155`
## • `95% CI - Low` -> `95% CI - Low...180`
## • `95% CI - High` -> `95% CI - High...181`
## • `95% CI - Low` -> `95% CI - Low...185`
## • `95% CI - High` -> `95% CI - High...186`
## • `95% CI - Low` -> `95% CI - Low...189`
## • `95% CI - High` -> `95% CI - High...190`
## • `95% CI - Low` -> `95% CI - Low...213`
## • `95% CI - High` -> `95% CI - High...214`
## • `95% CI - Low` -> `95% CI - Low...216`
## • `95% CI - High` -> `95% CI - High...217`
## • `95% CI - Low` -> `95% CI - Low...220`
## • `95% CI - High` -> `95% CI - High...221`
## • `95% CI - Low` -> `95% CI - Low...224`
## • `95% CI - High` -> `95% CI - High...225`
## • `95% CI - Low` -> `95% CI - Low...231`
## • `95% CI - High` -> `95% CI - High...232`
## • `95% CI - Low` -> `95% CI - Low...235`
## • `95% CI - High` -> `95% CI - High...236`
## • `Average Grade Performance` -> `Average Grade Performance...246`
## • `Average Grade Performance (AIAN)` -> `Average Grade Performance
## (AIAN)...247`
## • `Average Grade Performance (Asian)` -> `Average Grade Performance
## (Asian)...248`
## • `Average Grade Performance (Black)` -> `Average Grade Performance
## (Black)...249`
## • `Average Grade Performance (Hispanic)` -> `Average Grade Performance
## (Hispanic)...250`
## • `Average Grade Performance (White)` -> `Average Grade Performance
## (White)...251`
## • `Average Grade Performance` -> `Average Grade Performance...252`
## • `Average Grade Performance (AIAN)` -> `Average Grade Performance
## (AIAN)...253`
## • `Average Grade Performance (Asian)` -> `Average Grade Performance
## (Asian)...254`
## • `Average Grade Performance (Black)` -> `Average Grade Performance
## (Black)...255`
## • `Average Grade Performance (Hispanic)` -> `Average Grade Performance
## (Hispanic)...256`
## • `Average Grade Performance (White)` -> `Average Grade Performance
## (White)...257`
## • `Segregation Index` -> `Segregation Index...258`
## • `95% CI - Low` -> `95% CI - Low...265`
## • `95% CI - High` -> `95% CI - High...266`
## • `95% CI - Low` -> `95% CI - Low...268`
## • `95% CI - High` -> `95% CI - High...269`
## • `Segregation Index` -> `Segregation Index...287`
## • `95% CI - Low` -> `95% CI - Low...289`
## • `95% CI - High` -> `95% CI - High...290`
## • `95% CI - Low` -> `95% CI - Low...314`
## • `95% CI - High` -> `95% CI - High...315`
## • `95% CI - Low` -> `95% CI - Low...339`
## • `95% CI - High` -> `95% CI - High...340`
## • `95% CI - Low` -> `95% CI - Low...363`
## • `95% CI - High` -> `95% CI - High...364`
## • `95% CI - Low` -> `95% CI - Low...366`
## • `95% CI - High` -> `95% CI - High...367`
## • `95% CI - Low` -> `95% CI - Low...386`
## • `95% CI - High` -> `95% CI - High...387`
## • `95% CI - Low` -> `95% CI - Low...391`
## • `95% CI - High` -> `95% CI - High...392`
select_keep <- raw_data %>% dplyr::select(FIPS, State, County, '% Vaccinated', 'Preventable Hospitalization Rate', '% with Annual Mammogram', 'Primary Care Physicians Rate', '% With Access to Exercise Opportunities')
addl_keep<- addl_data %>% dplyr::select(FIPS, State, County, 'Life Expectancy', '% Rural')
Data Combined, Fit the Model
combined_df <- addl_keep %>% left_join(select_keep, by=c ("FIPS", "State", "County")) %>% drop_na()
model <- lm(`Life Expectancy` ~ `% Vaccinated` + `Preventable Hospitalization Rate` + `% with Annual Mammogram` + `Primary Care Physicians Rate` + `% With Access to Exercise Opportunities`, data = combined_df)
summary(model)
##
## Call:
## lm(formula = `Life Expectancy` ~ `% Vaccinated` + `Preventable Hospitalization Rate` +
## `% with Annual Mammogram` + `Primary Care Physicians Rate` +
## `% With Access to Exercise Opportunities`, data = combined_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4933 -1.5876 0.0086 1.4292 8.3931
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 69.4444164 1.0967854 63.316
## `% Vaccinated` 0.0675634 0.0237883 2.840
## `Preventable Hospitalization Rate` -0.0006417 0.0001786 -3.593
## `% with Annual Mammogram` 0.0670691 0.0284239 2.360
## `Primary Care Physicians Rate` 0.0072728 0.0057159 1.272
## `% With Access to Exercise Opportunities` 0.0269156 0.0076966 3.497
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## `% Vaccinated` 0.004970 **
## `Preventable Hospitalization Rate` 0.000410 ***
## `% with Annual Mammogram` 0.019248 *
## `Primary Care Physicians Rate` 0.204699
## `% With Access to Exercise Opportunities` 0.000578 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.264 on 202 degrees of freedom
## Multiple R-squared: 0.2999, Adjusted R-squared: 0.2825
## F-statistic: 17.3 on 5 and 202 DF, p-value: 3.05e-14
#This combines both sheets in the data and gives a continuous dependent variable and predictors tied to Medicare, access, and preventive care
Question 3a - Check Assumptions - Linearity (plot and raintest)
model <- lm(`Life Expectancy` ~ `% Vaccinated` + `Preventable Hospitalization Rate` + `% with Annual Mammogram` + `Primary Care Physicians Rate` + `% With Access to Exercise Opportunities`, data = combined_df)
summary(model)
##
## Call:
## lm(formula = `Life Expectancy` ~ `% Vaccinated` + `Preventable Hospitalization Rate` +
## `% with Annual Mammogram` + `Primary Care Physicians Rate` +
## `% With Access to Exercise Opportunities`, data = combined_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4933 -1.5876 0.0086 1.4292 8.3931
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 69.4444164 1.0967854 63.316
## `% Vaccinated` 0.0675634 0.0237883 2.840
## `Preventable Hospitalization Rate` -0.0006417 0.0001786 -3.593
## `% with Annual Mammogram` 0.0670691 0.0284239 2.360
## `Primary Care Physicians Rate` 0.0072728 0.0057159 1.272
## `% With Access to Exercise Opportunities` 0.0269156 0.0076966 3.497
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## `% Vaccinated` 0.004970 **
## `Preventable Hospitalization Rate` 0.000410 ***
## `% with Annual Mammogram` 0.019248 *
## `Primary Care Physicians Rate` 0.204699
## `% With Access to Exercise Opportunities` 0.000578 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.264 on 202 degrees of freedom
## Multiple R-squared: 0.2999, Adjusted R-squared: 0.2825
## F-statistic: 17.3 on 5 and 202 DF, p-value: 3.05e-14
plot(model,which=1)

raintest(model)
##
## Rainbow test
##
## data: model
## Rain = 1.0147, df1 = 104, df2 = 98, p-value = 0.4716
# Assumption check for linearity is mostly met. The residuals vs. fitted plot looks fairly random with only a slight curve, and the rainbow test is not significant so there is not strong evidence of major nonlinearity.
Question 3b - Check Assumptions - Independence of errors
(durbin-watson)
library(car)
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.06900894 1.842499 0.25
## Alternative hypothesis: rho != 0
# The Assumption check for Independence of errors is met. The Durbin-Watson statistic is 1.847 with p = 0.272, which is close to 2 and not significant, so there is not strong evidence of autocorrelation.
Question 3c - Check Assumptions - Homoscedasticity (plot,
bptest)
plot(model,which=1)

bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 14.691, df = 5, p-value = 0.01177
# The Assumption check for Homoscedasticity shows a violation. The BP test is significant with p= 0.01201, which suggests non-constant variance in residuals
Question 3d - Check Assumptions - Normality of residuals (QQ plot,
shapiro test)
plot(model,which=2)

qqnorm(resid(model))
qqline(resid(model), col= "red")

shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.98732, p-value = 0.0607
# The assumption check for normality of residuals is mostly met, with a small caveat. The Q-Q plot follows the line fairly well except for the upper tail, and Shapiro-Wilk is p=0.06039, which is slightly above 0.05
Question 3e - Check Assumptions - No multicolinarity (VIF, cor)
vif(model)
## `% Vaccinated`
## 1.806629
## `Preventable Hospitalization Rate`
## 1.071509
## `% with Annual Mammogram`
## 1.839736
## `Primary Care Physicians Rate`
## 1.141538
## `% With Access to Exercise Opportunities`
## 1.150237
cor(combined_df[, c("% Vaccinated", "Preventable Hospitalization Rate", "% with Annual Mammogram", "Primary Care Physicians Rate", "% With Access to Exercise Opportunities")], use = "complete.obs")
## % Vaccinated
## % Vaccinated 1.00000000
## Preventable Hospitalization Rate 0.03332079
## % with Annual Mammogram 0.64260568
## Primary Care Physicians Rate 0.12569137
## % With Access to Exercise Opportunities 0.13849826
## Preventable Hospitalization Rate
## % Vaccinated 0.03332079
## Preventable Hospitalization Rate 1.00000000
## % with Annual Mammogram -0.16220260
## Primary Care Physicians Rate -0.03869987
## % With Access to Exercise Opportunities -0.07102028
## % with Annual Mammogram
## % Vaccinated 0.64260568
## Preventable Hospitalization Rate -0.16220260
## % with Annual Mammogram 1.00000000
## Primary Care Physicians Rate 0.14563327
## % With Access to Exercise Opportunities 0.04536222
## Primary Care Physicians Rate
## % Vaccinated 0.12569137
## Preventable Hospitalization Rate -0.03869987
## % with Annual Mammogram 0.14563327
## Primary Care Physicians Rate 1.00000000
## % With Access to Exercise Opportunities 0.32678300
## % With Access to Exercise Opportunities
## % Vaccinated 0.13849826
## Preventable Hospitalization Rate -0.07102028
## % with Annual Mammogram 0.04536222
## Primary Care Physicians Rate 0.32678300
## % With Access to Exercise Opportunities 1.00000000
# The assumption check for multicollinearity is met. The VIF values are all low, around 1.07 to 1.81, and the correlations among predictors are not high enough to suggest a serious collinearity problem.
Questions 4, 5, and 6
# Overall, the model mostly meets the assumptions, but the strongest concern is homoscedasticity because the BP Test is significant. Linearity looks reasonably acceptable from the residuals vs. fitted plot and the rainbow test, and independence looks acceptable from the Durbin-Watson test.
# The clearest violation is homoscedasticity. The residuals appear to fan out a bit at the fitted extremes, adn the BP test confirms that the variance is not fully constant.
# A good fix is to use robust standard errors so coefficient tests are less sensitive to unequal variance. Note: I tried used CrPlot and got an error stating '% Vaccinated" was not in the model.