Presentation of Data

This is a multiple linear regression analysis from data collected for the Study on the Efficacy of Nosocomial Infection Control (SENIC).

The goal of this analysis is to determine which factors in this dataset can help to better understand and predict the rates of nosocomial (hospital acquired) infection in US hospitals.

The data consists of a random sample of 113 hospitals selected from the original 338 hospitals surveyed. Each line of the data set has an identification number and provides information on 11 other variables for a single hospital.

The 12 variables are:

The response variable in this dataset is infectionRisk

Here you can see a basic representation of this variable in the form of a boxplot.

The remaining variables, excluding id, will be used initially as the independent variables.

A multiple linear regression model was built based on the variables that proved to have the best predictive ability on this measure of infection risk. This submission will describe how this model was chosen.

The alpha value used throughout this analysis will be 0.05

Data Visualization

My first step was to create some basic graphs of the data, starting with boxplots of each numeric variable, then plots of each variable against the response.

There were no red flags among those plots, other than the possibility of outliers in avgDailyCensus and stay, which will be further investigated with Cook’s Distance later.

Checking Assumptions

Here is a histogram of the response variable. It appears normally distributed.

Here, the Shapiro-Wilkes test provides further evidence that the response variable is normally distributed with a p-value of 0.1339.

## 
##  Shapiro-Wilk normality test
## 
## data:  response
## W = 0.98204, p-value = 0.1339

Here, the pairs plot is used to get an idea of multicollinearity among the numeric variables. It looks like there is some risk from the variables in the bottom right. That will be evaluated further with the variance inflation factor (VIF) analysis.

Here are residual plots from a linear model using all of the independent variables. There is no discernible pattern in the residuals plotted against fitted values and the QQ plot looks fairly well fitted to the line. I am confident in a constant variance and normality for the residuals. The Shapiro Wilkes test was also conducted on the residuals and they are normally distributed with a p-value of 0.4753.

## 
##  Shapiro-Wilk normality test
## 
## data:  fullmodel$residuals
## W = 0.98876, p-value = 0.4753

Assumptions for the residuals appear to be met and we can move forward.

Next are the observations with the highest Cook’s Distance scores. I chose to use 8/n as my maximum acceptable Cook’s D score.

##     Cooks Distance Score
## 8              0.1569191
## 53             0.1425955
## 112            0.1717169

These three observations (8, 53, and 112) are the most influential data points. My acceptance criteria for the Cook’s D score was fairly relaxed and yet these observations are still beyond that boundary. Therefore, I will remove these from the data for my regression analysis.

The next test for assumptions is the VIF test mentioned earlier.

##                VIF Score
## numBeds        34.683608
## avgDailyCensus 36.668432
## numNurses       7.282327

In this case, I will only remove the variables with the top two highest VIF scores, numBeds and avgDailyCensus.

Model Selection

For this analysis, I chose to use the AIC method of stepwise selection.

Here is the resulting selection for the model with only main effects.

## lm(formula = response ~ cultRatio + stay + facilities + region + 
##     chestXrayRatio + medSchool)

And here is the resulting selection for the model once interaction effects have been included.

## lm(formula = response ~ stay + cultRatio + chestXrayRatio + factor(region) + 
##     facilities + cultRatio:facilities)

Here, I am using a pairs plot with the remaining numeric variables to check if the multicollinearity witnessed earlier has been addressed.

The pairs plot looks much better. Here are some of the earlier plots, this time with data from the selected model.

## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.98944, p-value = 0.531

The QQ plot appears a little concerning near the top, but the Shapiro Wilkes test confirms normality for the residuals with a p-value of 0.531.

Model Analysis

This is the model selected for analysis, with variable names reworded slightly for clarity.

InfectionRisk = \(\beta\)0 + \(\beta\)1Stay + \(\beta\)2CultureRatio + \(\beta\)3ChestXrayRatio + \(\beta\)4Region2 + \(\beta\)5Region3 + \(\beta\)6Region4 + \(\beta\)7Facilities + \(\beta\)8CultureRatio:Facilities + \(\epsilon\)

\(\epsilon\)~N(0,\(\sigma\)2) ind

As a reminder, here is how the regions are coded. Here, the baseline is region 1 for NE.

  • region: Geographic region

    1 = NE

    2 = NC

    3 = S

    4 = W

Here are the sum of squares for this model, with each one showing the additional sum of squares added when that particular variable is added to the model. Please note, the order of variables is relevant here.

##                      Sum Sq
## stay                 57.305
## cultRatio            33.397
## chestXrayRatio        3.857
## factor(region)        9.166
## facilities            8.843
## cultRatio:facilities  7.379
## Residuals            81.433

Here are the estimates of the \(\beta\) values in the formula above, from \(\beta\)0, which is the intercept, to \(\beta\)8 for the estimate of the CultureRatio:Facilities interaction.

Below that, you will see the adjusted R-Square, which is moderately high, showing that the model explains 56.45 % of variation in the infectionRisk variable.

##                         Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)          -1.89108553 0.7029276177 -2.6902991 8.318457e-03
## stay                  0.26404571 0.0558267474  4.7297349 7.099122e-06
## cultRatio             0.11780497 0.0241895516  4.8700765 4.001523e-06
## chestXrayRatio        0.01071084 0.0050643439  2.1149513 3.682323e-02
## factor(region)2       0.18206500 0.2445098014  0.7446123 4.581854e-01
## factor(region)3       0.23216382 0.2510009500  0.9249519 3.571322e-01
## factor(region)4       0.92195699 0.3204619575  2.8769624 4.872690e-03
## facilities            0.04237114 0.0094054936  4.5049358 1.742654e-05
## cultRatio:facilities -0.00158171 0.0005152602 -3.0697302 2.733190e-03
## [1] 0.5645178

Conclusions

We started with 10 variables in the original dataset (not including Hospital ID numbers and the response variable itself).

Here are the significant effects:

stay: Average length of stay of all patients in hospital (in days)

  • The length of the average patient’s stay does have a positive linear relationship with the risk of infection. On average, the rate of infection will be 0.26 points higher for every additional day increase in the average number of days patient stay in that hospital.

chestXrayRatio: Routine chest x-ray ratio, which is the ratio of number of x-rays performed to the number of patients without signs or symptoms of pneumonia, times 100.

  • The ratio of routine chest x-rays does have a positive linear relationship with the risk of infection. On average, the rate of infection will be 0.01 points higher for every point increase in the ratio of x-rays.

region: Geographic region, where 1 = NE, 2 = NC, 3 = S, 4 = W.

  • The geographic region does have a positive linear relationship with the risk of infection. On average, the rate of infection will be 0.18 points higher for NC hospitals than NE, 0.23 points higher for S hospitals than NE, and 0.92 points higher for W hospitals than NE.

Interaction Effects:

  • cultRatio: Routine culturing ratio, which is the ratio of number of cultures performed to number of patient without signs or symptoms of hospital acquired infection, times 100.

  • facilities: Available facilities and services, which is percent of 35 potential facilities and services that are provided by the hospital.

The culturing ratio and available facilities percentage have an interaction effect, illustrated here. As you can see, any time we talk about the effect of one of these variables, it would be dependent on the other.

CONCLUSION SUMMARY: According to this analysis, the most effective ways to mitigate infection risk would be to reduce patients’ stay, minimize unnecessary x-rays, and investigate the imbalance among the 4 regions (which may warrant another analysis to determine what is being done differently among the regions).

Code:

#IN THIS PROGRAM, I HAVE SOUGHT TO MINIMIZE HARD CODING OF A LINEAR REGRESSION ANALYSIS.
#ONLY THE LINES THAT ARE INDENTED REQUIRE HARD CODING
#PLEASE NOTE THAT THE VISUALIZATION PLOTS ARE SET TO SAVE IN THE WORKING DIRECTORY AS .JPG

####Reading in the data
rm(list=ls())
library(jpeg)
library(car)
library(sjPlot)
    setwd("C:/Users/chris/OneDrive/Desktop/School Docs/STAT 5543/Final Project")
    file<-read.csv("SENIC.csv",header=T)
attach(file)
    response<-infectionRisk
    resp_name<-"Infection Risk"
#data cleaning
    file$medSchool<-as.factor(medSchool)
    file$region<-as.factor(region)
    file$id<-as.character(id)
####Data Visualization####
str(file)
head(file)
#pulling out just numeric columns
num_vars<-sapply(file,is.numeric)
num_only<-file[,num_vars]
str(num_only)
#Creating and saving boxplots for all of the continuous variables
for(i in 1:length(num_only)){
k<-names(num_only[i])
jpeg(paste0("Boxplot_",k,".jpg"))
boxplot(num_only[i])
dev.off()
}
#Creating and saving plots for all variables against response variable
#This will also include a plot of the response variable against itself, which can be discarded
for(i in 1:length(file)){
k<-names(file[i])
jpeg(paste0("Plot_",k,".jpg"))
plot(file[,i],response,xlab=k,ylab=resp_name)
dev.off()
}
####Checking Assumptions####
#Checking normality of response variable
hist(response,main=NA,xlab=resp_name)
shapiro.test(response) #null hypothesis for shapiro.test is that data is normal
#Pairs Plot
jpeg("Pairs_Plot.jpg")
pairs(num_only)
dev.off()
#Creating a subset of data without the response and any other ancillary variables
#Leaving only the independent variables I want in the model
str(file)
    ind_var<-file[-c(1,4)]
#Creating the lm model without hard coding; FULL MODEL/NO INTERACTIONS YET
x<-as.formula(paste("response~",paste0(names(ind_var),collapse="+")))
fullmodel<-lm(x)
#Saving residuals plot; FULL MODEL/NO INTERACTIONS YET
jpeg("FullModel_resplot.jpg")
plot(fullmodel$fitted.values,fullmodel$residuals,xlab="Fitted Values",ylab="Residuals",main="Residuals for Full Model")
dev.off()
#Saving QQ plot; FULL MODEL/NO INTERACTIONS YET
jpeg("FullModel_qqplot.jpg")
qqnorm(fullmodel$residuals)
qqline(fullmodel$residuals)
dev.off()
#Checking normality of residuals
shapiro.test(fullmodel$residuals)
#Cook's Distance for influential data points; I chose 8/n for the maximum allowable Cook's Distance score
max_cd<-8/length(response)
infl_test<-cooks.distance(fullmodel)>max_cd
cd<-cooks.distance(fullmodel)
infl<-data.frame(cd[infl_test])
names(infl)<-"Cooks Distance Score"
#The object named infl is now a dataframe with the highest influential data points
#The row numbers from the original dataset will be maintained, in case the decision is made to remove them
#VIF: checking for multicollinearity
vif<-vif(fullmodel)
vif_test<-vif(fullmodel)>5
VIF_High<-data.frame(vif[vif_test])
names(VIF_High)<-"VIF Score"
#The object named VIF_High is now a dataframe with all of the variables with potential multicollinearity in this model
#The variable names will be maintained, in case the decision is made to remove them
####Prepping Data for Modeling####
#Here I use the outliers and VIF_High to make decisions about modifying the data set
infl
VIF_High
#Removing rows if necessary
    file2<-file[-c(8,53,112), ]
#Removing columns if necessary
    file2<-subset(file2,select=-numBeds)
    file2<-subset(file2,select=-avgDailyCensus)
#Once again, creating a subset of data without the response and any other ancillary variables
#Leaving only the independent variables I want in the model
str(file2)
    ind_var2<-file2[-c(1,4)]
####Stepwise selection####
fit0<-lm(response~1)
x<-as.formula(paste("response~",paste0(names(ind_var2),collapse="+")))
stepmodel1<-step(fit0,x,direction='both',trace=0)
#Creating new ind_var to remove main effects not used in model1, to ensure interaction model does not include them
stepmodel1$call
str(ind_var2)
  ind_var3<-ind_var2[-c(2,7)]
x<-as.formula(paste("response~",paste0(names(ind_var3),collapse="+")))
model1<-lm(x)
#The object model1 will give the results from running the model chosen by the step function/ NO INTERACTIONS YET
stepmodel2<-step(model1,scope=.~.^2,direction="both",trace=0)
model2<-lm(as.formula(stepmodel2$call))
#The object model2 will give the results from running the model chosen by the step function/ INTERACTIONS INCLUDED
####Analyzing Chosen Model####
#Same plots used earlier, this time with the chosen model, to check if violated assumptions have been addressed
#For the pairs plot, removed factor variables
str(ind_var3)
    y<-ind_var3[-c(4,5)]
jpeg("FINAL.Pairs_Plot.jpg")
pairs(y)
dev.off()
jpeg("FINAL_resplot.jpg")
plot(model2$fitted.values,model2$residuals,xlab="Fitted Values",ylab="Residuals",main="Residuals for Final Model")
dev.off()
#Testing the normality of residuals on the final plot
jpeg("FINAL_qqplot.jpg")
qqnorm(model2$residuals)
qqline(model2$residuals)
dev.off()
shapiro.test(model2$residuals)
#Here I am showing which variables are factors from the remaining variables. If they are in the final model, you
#can consider rerunning the model with the factor variable(s) coded as such in the model statement
z<-vector()
for(i in 1:length(ind_var3)){
x<-class(ind_var3[[i]])
y<-x=="factor"
z<-c(z,y)
}
names(ind_var3)[z] #All the factor variables
stepmodel2$call #Calling up the model statement so you can check which factors are present
#At this point, copy and paste the model from the console and reclassify the factors if desired
    model3<-lm(formula = response ~ stay + cultRatio + chestXrayRatio + factor(region) + 
    facilities + cultRatio:facilities)
#Model diagnostics
model3sum<-summary(model3)
model3sum
anova(model3)
model3sum$adj.r.squared