First, I will load in helpful packages, set a working directory, and load my dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
setwd("C:/Users/andie/Downloads")
NH_ProviderInfo_Jul2025 <- read_excel("NH_ProviderInfo_Jul2025.xlsx")

I will begin this homework by selecting data I am interested in exploring further.

nursing.home<- NH_ProviderInfo_Jul2025 |> select("Total Weighted Health Survey Score", "Ownership Type", "Number of Certified Beds", "Total nursing staff turnover")

One of these variables, Ownership Type, is categorical, not numerical. The categories include:

For ease of analysis, I will consolidate the sub-categories into three main categories: for profit, government, and non-profit.

nursing.home2 <- nursing.home |> mutate(Ownership_Type_Num = case_when(
`Ownership Type` %in% c(
"For profit - Corporation",
"For profit - Individual",
"For profit - Limited Liability company",
"For profit - Partnership"
) ~ 1,
`Ownership Type` %in% c(
"Government - City",
"Government - City/county",
"Government - County",
"Government - Federal",
"Government - Hospital district",
"Government - State"
) ~ 2,
`Ownership Type` %in% c(
"Non profit - Church related",
"Non profit - Corporation",
"Non profit - Other"
) ~ 3,
TRUE ~ NA_real_
)
)

My dataset now has 6 variables, with 1 representing for profit nursing homes, 2 representing government nursing homes, and 3 representing non-profit nursing homes. However, I am not able to run a correlation test yet because the assigned numerical values are just made up. I need to convert the variable into binary variables to run correlation tests.

nursing.home3 <- nursing.home2 |>
  mutate(
    FOR_PROFIT = ifelse(Ownership_Type_Num == 1, 1, 0),
    GOVERNMENT = ifelse(Ownership_Type_Num == 2, 1, 0),
    NON_PROFIT = ifelse(Ownership_Type_Num == 3, 1, 0)
  )

Getting close! I need to remove the original “Ownership Type” variable from the dataset because it is not numeric. I can also remove the “Ownership_Type_Num” since the data has been turned into binary variables.

nursing.home4 <- nursing.home3 |> select("Total Weighted Health Survey Score", "Number of Certified Beds", "Total nursing staff turnover", "FOR_PROFIT", "GOVERNMENT", "NON_PROFIT")

As a final clean-up step, I need to remove NA values.

summary(nursing.home4)
##  Total Weighted Health Survey Score Number of Certified Beds
##  Min.   :   0.00                    Min.   :  2.0           
##  1st Qu.:  28.00                    1st Qu.: 66.0           
##  Median :  56.00                    Median :100.0           
##  Mean   :  85.83                    Mean   :106.8           
##  3rd Qu.: 108.00                    3rd Qu.:127.0           
##  Max.   :1723.25                    Max.   :843.0           
##  NA's   :56                                                 
##  Total nursing staff turnover   FOR_PROFIT       GOVERNMENT     
##  Min.   :  2.60               Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 36.70               1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 46.30               Median :1.0000   Median :0.00000  
##  Mean   : 46.87               Mean   :0.7332   Mean   :0.06514  
##  3rd Qu.: 56.50               3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :100.00               Max.   :1.0000   Max.   :1.00000  
##  NA's   :1105                                                   
##    NON_PROFIT    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2017  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
## 

Total Weighted Health Survey Score has 56 missing values, and Total nursing staff turnover has 1105.

nursing.home.clean <- nursing.home4 |> filter(!is.na(`Total Weighted Health Survey Score`), !is.na(`Total nursing staff turnover`))

Now I can begin the homework! Let’s check if there is any obvious correlations using COR.

cor(nursing.home.clean)
##                                    Total Weighted Health Survey Score
## Total Weighted Health Survey Score                         1.00000000
## Number of Certified Beds                                   0.16406499
## Total nursing staff turnover                               0.23113375
## FOR_PROFIT                                                 0.16999498
## GOVERNMENT                                                -0.05709394
## NON_PROFIT                                                -0.15265859
##                                    Number of Certified Beds
## Total Weighted Health Survey Score               0.16406499
## Number of Certified Beds                         1.00000000
## Total nursing staff turnover                    -0.07218797
## FOR_PROFIT                                       0.09265042
## GOVERNMENT                                       0.03752292
## NON_PROFIT                                      -0.12480015
##                                    Total nursing staff turnover  FOR_PROFIT
## Total Weighted Health Survey Score                   0.23113375  0.16999498
## Number of Certified Beds                            -0.07218797  0.09265042
## Total nursing staff turnover                         1.00000000  0.15487169
## FOR_PROFIT                                           0.15487169  1.00000000
## GOVERNMENT                                          -0.04914286 -0.43493540
## NON_PROFIT                                          -0.14081803 -0.83797299
##                                     GOVERNMENT NON_PROFIT
## Total Weighted Health Survey Score -0.05709394 -0.1526586
## Number of Certified Beds            0.03752292 -0.1248001
## Total nursing staff turnover       -0.04914286 -0.1408180
## FOR_PROFIT                         -0.43493540 -0.8379730
## GOVERNMENT                          1.00000000 -0.1269283
## NON_PROFIT                         -0.12692834  1.0000000

From the initial test, there doesn’t seem to be any strong correlations between the variables I selected. I am ignoring correlations between the for profit, government, and non-profit variables since the number doesn’t really mean anything.

Let’s see what this information looks like visually.

pairs(~`Total Weighted Health Survey Score`+`Number of Certified Beds`+`Total nursing staff turnover`+`FOR_PROFIT`+`GOVERNMENT`+`NON_PROFIT`, data=nursing.home.clean)

Interesting! The data still doesn’t appear correlated. Perhaps this is because the data is not normal. I will look further into the correlation between total nursing staff turnover and total weighted health survey score, as they have the largest, albeit still small, correlation of 0.23113375 with the Pearson method.

cor.test(nursing.home.clean$`Total nursing staff turnover`, nursing.home.clean$`Total Weighted Health Survey Score`, method="kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  nursing.home.clean$`Total nursing staff turnover` and nursing.home.clean$`Total Weighted Health Survey Score`
## z = 32.198, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.1845374

I selected the Kendall method because I am not certain if the data is normally distributed, and this method does not make the assumption that is is. The professor also told the class it was his preferred method, which seems like good enough reason to me!

The p-value is very small; this indicates that the data is very unlikely to have occurred by random chance. The null hypothesis that there is no correlation between total nursing staff turnover and total weighted health survey score. The tau value of .1845 indicates that there is a weak to moderate correlation between the variables.