To start off my homework, I need prepare my data. I previously did these steps in Homework 5 and 6.

#Loading packages and setting working directory
library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.5.2

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

setwd("C:/Users/andie/Downloads")
NH_ProviderInfo_Jul2025 <- read_excel("NH_ProviderInfo_Jul2025.xlsx")

#Selecting data I am interested in exploring further
nursing.home<- NH_ProviderInfo_Jul2025 |> select("Total Weighted Health Survey Score", "Ownership Type", "Number of Certified Beds", "Total nursing staff turnover")

#Changing "Ownership Type" from categorical to binary value. I am creating three categories: for profit, government, and nonprofit.
nursing.home2 <- nursing.home |> mutate(Ownership_Type_Num = case_when(
`Ownership Type` %in% c(
"For profit - Corporation",
"For profit - Individual",
"For profit - Limited Liability company",
"For profit - Partnership"
) ~ 1,
`Ownership Type` %in% c(
"Government - City",
"Government - City/county",
"Government - County",
"Government - Federal",
"Government - Hospital district",
"Government - State"
) ~ 2,
`Ownership Type` %in% c(
"Non profit - Church related",
"Non profit - Corporation",
"Non profit - Other"
) ~ 3,
TRUE ~ NA_real_
)
)
#Now converting this into a binary variable:
nursing.home3 <- nursing.home2 |>
  mutate(
    FOR_PROFIT = ifelse(Ownership_Type_Num == 1, 1, 0),
    GOVERNMENT = ifelse(Ownership_Type_Num == 2, 1, 0),
    NON_PROFIT = ifelse(Ownership_Type_Num == 3, 1, 0)
  )

#I'll tidy the data a bit by removing the "Ownership Type" and "Ownership_Type_Num" variables: 
nursing.home4 <- nursing.home3 |> select("Total Weighted Health Survey Score", "Number of Certified Beds", "Total nursing staff turnover", "FOR_PROFIT", "GOVERNMENT", "NON_PROFIT")

#As a final step, I am removing NA values from the data. Total Weighted Health Survey Score has 56 missing values, and Total nursing staff turnover has 1105.

nursing.home.clean <- nursing.home4 |> filter(!is.na(`Total Weighted Health Survey Score`), !is.na(`Total nursing staff turnover`))

Similarly to Homework 6, I will use the Total Weighted Health Survey Score as my dependent variable. Number of Certified Beds, Total nursing staff turnover, and ownership type (for profit, government, or nonprofit) will be my independent variables. Let’s create a linear model:

model<-lm(`Total Weighted Health Survey Score`~`Number of Certified Beds`+`Total nursing staff turnover`+FOR_PROFIT+GOVERNMENT+NON_PROFIT,data=nursing.home.clean)

Let’s check if this model meets the required assumptions for this linear model to be true.

Checking Assumptions

Linearity of Variables

To start off, we can visually analyze how linear the model is with a simple plot.

plot(model, which=1)

As I noted in Homework 6, the variables are not totally linear. It appears to follow linearity somewhat closely, but there is definitely sloping downwards. We cn use a “rainbow test” to see if the variables have a linear relationship.

raintest(model)

## 
##  Rainbow test
## 
## data:  model
## Rain = 1.4343, df1 = 6820, df2 = 6813, p-value < 2.2e-16

The null hypothesis for the rainbow test is that the data is linear. The very small p-value of this test means that the data is not linear.

Independence of Errors

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

durbinWatsonTest(model)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1097863      1.780421       0
##  Alternative hypothesis: rho != 0

For the Durbin Watson Test, a p-value greater than .05 means that the errors are independent. For my data, the p-value is 0. The errors are not independent.

Homoscedasticity

We can visually inspect the homoscedasticity with a plot.

plot(model, which=3)

The red line is not totally straight; it slopes upwards. It doesn’t appear that the points cluster closely around the line, either. We can also check with math:

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 281.03, df = 4, p-value < 2.2e-16

In the Breusch-Pagan test, a p-value above .05 means the null hypothesis– that the model is homoscedastic– can be rejected. The model is heteroscedastic.

Normality of Residuals

Let’s make another plot!

plot(model, which=2)

Once again, the observations are not along the dotted line. Let’s use math to check again. Because my dataset has more than 5,000 observations, I will use the Kolmogorov-Smirnov test.

ks.test(model$residuals, "pnorm")

## Warning in ks.test.default(model$residuals, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model$residuals
## D = 0.62269, p-value < 2.2e-16
## alternative hypothesis: two-sided

The p-value is less than 0.05; the null hypothesis can be rejected. The Kolmogorov-Smirnov test is sensitive to non-normality, especially to a dataset as large as the one I am using, but I do accept its conclusion that the residuals are not normal because the plot clearly demonstrates this finding.

No Multicollinearity

To test for the final assumption, we can use the vif command to see if any of the variables are strongly correlated with each other.

Note: I ran this as a chunk, but my homework would not ‘knit’ and publish with the chunk. I have copied the formula and error message below. vif(model) Error in vif.default(model) : there are aliased coefficients in the model

This error message means I have perfect multicollinearity. I am wondering if my binary variables are causing this to happen. I’m going to try creating a new model that leaves out the binary variable representing government nursing homes, as this category has the smallest number of nursing homes.

model.1<-lm(`Total Weighted Health Survey Score`~`Number of Certified Beds`+`Total nursing staff turnover`+FOR_PROFIT+NON_PROFIT,data=nursing.home.clean)
vif(model.1)

##     `Number of Certified Beds` `Total nursing staff turnover` 
##                       1.024634                       1.033395 
##                     FOR_PROFIT                     NON_PROFIT 
##                       3.374509                       3.386151

These VIF values are very low; this model meets the assumption of no multicollinearity. However, my model with all categories– for-profit, non-profit, and government– does not meet this assumption.

Discussion

My model does not meet the assumptions for my linear model to be true.

Linearity of variables: the rainbow test had a very small p-value; my data is not linear.
Independence of variables: the Durbin Watson Test, my data had a p-value of 0. The errors are not independent.
Homoscedasticity: the Breusch-Pagan test produced a very small p-value; my model is heteroscedastic.
Normality of residuals: the Kolmogorov-Smirnov test had a p-value under .05; the residuals of my model are not normal.
No multicollinearity: my original model that included for-profit, non-profit, and government nursing homes could not be run with the vif() command because there were ‘aliased coefficents’ in the model. When I created a new linear model with the government variable dropped, the new model met the assumption of no multicollinearity.

In summary, my original model violated all the assumptions of a linear model.

Mitigation

My mitigation will involve two steps: first, I need to drop the government binary variable from my model.

#I'll remove the government binary variable from my data set: 
nursing.home5 <- nursing.home4 |> select("Total Weighted Health Survey Score", "Number of Certified Beds", "Total nursing staff turnover", "FOR_PROFIT", "NON_PROFIT")

#I am removing NA values from the data. Total Weighted Health Survey Score has 56 missing values, and Total nursing staff turnover has 1105.

nursing.home.clean2 <- nursing.home5 |> filter(!is.na(`Total Weighted Health Survey Score`), !is.na(`Total nursing staff turnover`))

Next, I need to check if my variables Total Health Survey Score, Number of Certified Beds, and Total nursing staff turnover have normal distributions.

hist(nursing.home.clean2$`Total Weighted Health Survey Score`)

hist(nursing.home.clean2$`Number of Certified Beds`)

hist(nursing.home.clean2$`Total nursing staff turnover`)

The Total Health Survey Score and Number of Certified Beds do not have normal distributions. Total nursing staff turnover has a fairly normal distribution. To make my data more normal, I will use a square root transformation on The Total Health Survey Score and Number of Certified Beds. From previous homework, I know that the Total Health Survey Score contains the score of 0; therefore, I cannot use a log transformation.

nursing.home.transformed <- nursing.home.clean2 |> mutate(across(c(`Total Weighted Health Survey Score`, `Number of Certified Beds`), ~ sqrt(.x), .names = "{.col}_sqrt"))

Now let’s see how these variables look:

hist(nursing.home.transformed$`Total Weighted Health Survey Score_sqrt`)

hist(nursing.home.transformed$`Number of Certified Beds_sqrt`)

This is definitely much more normal. I want to do a quick examination of if this transformed data meets all the assumptions of a linear model.

#I am going to create another dataframe that drops the non-transformed Total Health Survey Score and Number of Certified Beds.
nursing.home.transformed2 <- nursing.home.transformed |> select(FOR_PROFIT, NON_PROFIT, "Total nursing staff turnover", "Number of Certified Beds_sqrt", "Total Weighted Health Survey Score_sqrt")

#Next, I'll create a new linear model:
model.3 <- lm(`Total Weighted Health Survey Score_sqrt`~`Total nursing staff turnover`+FOR_PROFIT+NON_PROFIT+`Number of Certified Beds_sqrt`, data=nursing.home.transformed2)

A lightening round of tests–

#Linearity of Variables
raintest(model.3)

## 
##  Rainbow test
## 
## data:  model.3
## Rain = 1.2836, df1 = 6820, df2 = 6814, p-value < 2.2e-16

Still not linear.

#Independence of Errors
durbinWatsonTest(model.3)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1578879      1.684203       0
##  Alternative hypothesis: rho != 0

The errors are not independent.

#Homoscedasticity
bptest(model.3)

## 
##  studentized Breusch-Pagan test
## 
## data:  model.3
## BP = 315.43, df = 4, p-value < 2.2e-16

The model is heteroscedastic.

#Normality of Residuals
ks.test(model.3$residuals, "pnorm")

## Warning in ks.test.default(model.3$residuals, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model.3$residuals
## D = 0.31983, p-value < 2.2e-16
## alternative hypothesis: two-sided

The residuals still appear to be not normal.

vif(model.3)

##  `Total nursing staff turnover`                      FOR_PROFIT 
##                        1.030244                        3.373695 
##                      NON_PROFIT `Number of Certified Beds_sqrt` 
##                        3.389208                        1.033277

Hooray! At least my model meets the assumption of no multicollinearity.

In conclusion, I have more work to do to make my model meet the assumptions for a linear model to be true.

Homework 7

Andie

2025-11-02