To start off my homework, I need prepare my data. I previously did these steps in Homework 5.

#Loading packages and setting working directory
library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("C:/Users/andie/Downloads")
NH_ProviderInfo_Jul2025 <- read_excel("NH_ProviderInfo_Jul2025.xlsx")

#Selecting data I am interested in exploring further
nursing.home<- NH_ProviderInfo_Jul2025 |> select("Total Weighted Health Survey Score", "Ownership Type", "Number of Certified Beds", "Total nursing staff turnover")

#Changing "Ownership Type" from categorical to binary value. I am creating three categories: for profit, government, and nonprofit.
nursing.home2 <- nursing.home |> mutate(Ownership_Type_Num = case_when(
`Ownership Type` %in% c(
"For profit - Corporation",
"For profit - Individual",
"For profit - Limited Liability company",
"For profit - Partnership"
) ~ 1,
`Ownership Type` %in% c(
"Government - City",
"Government - City/county",
"Government - County",
"Government - Federal",
"Government - Hospital district",
"Government - State"
) ~ 2,
`Ownership Type` %in% c(
"Non profit - Church related",
"Non profit - Corporation",
"Non profit - Other"
) ~ 3,
TRUE ~ NA_real_
)
)
#Now converting this into a binary variable:
nursing.home3 <- nursing.home2 |>
  mutate(
    FOR_PROFIT = ifelse(Ownership_Type_Num == 1, 1, 0),
    GOVERNMENT = ifelse(Ownership_Type_Num == 2, 1, 0),
    NON_PROFIT = ifelse(Ownership_Type_Num == 3, 1, 0)
  )

#I'll tidy the data a bit by removing the "Ownership Type" and "Ownership_Type_Num" variables: 
nursing.home4 <- nursing.home3 |> select("Total Weighted Health Survey Score", "Number of Certified Beds", "Total nursing staff turnover", "FOR_PROFIT", "GOVERNMENT", "NON_PROFIT")

#As a final step, I am removing NA values from the data. Total Weighted Health Survey Score has 56 missing values, and Total nursing staff turnover has 1105.

nursing.home.clean <- nursing.home4 |> filter(!is.na(`Total Weighted Health Survey Score`), !is.na(`Total nursing staff turnover`))

I will use the Total Weighted Health Survey Score as my dependent variable. Number of Certified Beds, Total nursing staff turnover, and ownership type (for profit, government, or nonprofit) will be my independent variables. Let’s create a linear model:

model<-lm(`Total Weighted Health Survey Score`~`Number of Certified Beds`+`Total nursing staff turnover`+FOR_PROFIT+GOVERNMENT+NON_PROFIT,data=nursing.home.clean)

Let’s see what this looks like!

summary(model)
## 
## Call:
## lm(formula = `Total Weighted Health Survey Score` ~ `Number of Certified Beds` + 
##     `Total nursing staff turnover` + FOR_PROFIT + GOVERNMENT + 
##     NON_PROFIT, data = nursing.home.clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -258.29  -51.56  -20.81   21.49 1388.52 
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -33.91353    3.26214 -10.396   <2e-16 ***
## `Number of Certified Beds`       0.27645    0.01343  20.582   <2e-16 ***
## `Total nursing staff turnover`   1.48410    0.05451  27.225   <2e-16 ***
## FOR_PROFIT                      26.27741    2.02841  12.955   <2e-16 ***
## GOVERNMENT                      -0.21018    3.62867  -0.058    0.954    
## NON_PROFIT                            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.51 on 13634 degrees of freedom
## Multiple R-squared:    0.1,  Adjusted R-squared:  0.09978 
## F-statistic: 378.9 on 4 and 13634 DF,  p-value: < 2.2e-16

The p-value is very small, meaning it is statistically significant. The R-squared is around 10%. This means that there are many factors that go into Total Weighted Health Survey Score that is not explained by the significant variables– the number of certified beds, total nursing staff turnover, or for-profit ownership status. Importantly, the p values for the government ownership type is not statistically significant, and the non-profit ownership type couldn’t be calculated.

According to this model, as the for-profit ownership goes up by 1, the Total Weighted Health Survey Score goes up by 26.28. This sentence doesn’t make much sense in the real world, as for-profit ownership can’t increase by 1. Because the for-profit ownership variable is a binary variable (with the value being 0 or 1), I interpret this as for-profit ownership being associated with a 26.28 increase in the Total Weighted Health Survey Score (higher scores indicate worse outcomes).For another example, as the Total nursing staff turnover increases by 1, the Total Weighted Health Survey Score goes up by 1.48.

Before taking this linear model as gospel, we should check if the variables selected are linear as with each other:

plot(model, which=1)

Unsurprisingly, the variables are not totally linear. It appears to follow linearity somewhat closely, but there is definitely sloping downwards. I am not certain if the variables are linear enough without additional data tidying to draw conclusions from the model.