Data Analysis Exam 1

Brandon W. Lee

October 13, 2016

1. Introduction

Ever since the dangerous emergence of nosocomial infections among US hospitals in 1974, the Center for Disease Control undertook a nationwide study to evaluate approaches to infection control. Researchers have collected data from a nationally representative sample of hospitals and are hoping to identify approaches to infection control that are most cost effective for hospitals; they also hope to point out additional specific questions to be answered by future research. In this project, we examine the link between the average length of stay of patients in the hospital and the risk of infection.

2. Exploratory Data Analysis

To examine this idea, researchers gathered data from 113 random hospitals across the nation. The details they gathered from each hospital include: patients’ average stay in the hospital, patients’ average age, patients’ infection risk, hospitals’ routine culturing ratio, hospitals’ routine chest xray ratio, hospitals’ number of beds, hospitals’ number of med schools, hospitals’ region number, hospitals’ average daily census, hospitals’ number of nurses, and hospitals’ available number of facilities.

For our data analysis, the 2 variables of interest are the average length of stay of patients in the hospital and the risk of infection. Figure 1a below shows the histograms of the distributions for each variable. The response variable is the patients’ length of stay in the hospital, which ranges from 6.70 to 19.56 days, has a mean of 9.65 days and a median of 9.42 days. Its distribution is uni-modal, centered around 9, and right-skewed with a short tail. The predictor variable is the hospitals’ infection rate, which ranges from 1.3 to 7.8 patients, has a mean of 4.35 patients and a median of 4.4 patients. Its distribution is uni-modal, centered around 1.5, and slightly left-skewed with a short tail.

Table 1: Summary of Predictor and Response Variables

	Minimum	Maximum	Mean	Median	Std. Dev
Length of Stay	6.70	19.56	9.65	9.42	1.91
Infection Risk	1.3	7.8	4.35	4.4	1.34

Figure 1a) Histograms of the distributions of predictor and the response variables

When exploring the bivariate relationship between Infection Rate and the Length of Stay, we can use a simple linear regression model as shown in Figure 1b. There appears to be some outliers in the extreme values of Infection Rate, and Figure 1c allows us to more clearly see that a non-linear relationship may be more appropriate when choosing a model for this raw set of data.

Figure 1b) Scatterplot with simple linear regression model; c) Scatterplot of Predictor vs. Residuals

3. Modeling and Diagnostics

Fitting the data to a simple linear regression model with a normal error assumption yields the estimated linear regression equation parameters displayed in Table 2 and Figure 1b. In Figure 2, we create diagnostic plots to check for model assumptions. The first plot shows the residuals against the predicted values of the linear model. While there are some extreme residual values, a slightly curved shape and non-constant variance, we do not have enough clear evidence to claim a non-linear relationship. The normal Q-Q plot allows us to verify the normal error assumption. For the normal assumption to hold true, the residual values should stick close to the red normal QQ line. However, it is clear that there are deviations from the line on both tails implying that the error is not approximately normal. The plot on the right represents the Box-Cox procedure and suggests that a λ = -1 Box-Cox transformation on the response variable would result in a better fitting model.

When comparing the maximum Cook’s Distance result to 2 and 111 (n-2) degrees of freedom, we see that the value is not very large. Specifically, a value of 0.3545018 does not suffice to identify it as an extreme outlier meaning that it will not have much influence or leverage on our linear model - so we decide not to remove any datapoints from our dataset.

Table 2: Simple Linear Regression before Transformation

	Estimate	Std. Error	t value	Pr(< \| t \| )
(Intercept)	6.3368	0.5213	12.156	< 2e-16
Infection Risk	0.7604	0.1144	6.645	1.18e-09

Figure 2a) Scatterplot of Predictor vs. Residuals; b) Normal Q-Q Plot of Residuals; c) Box-Cox Transformation

When applying the λ = -1 Box-Cox Transformation on the response variable, we obtain the linear regression equation parameters displayed in Table 3. We can observe in Figure 3 that our linear regression model appears to fit better with our data. Specifically, the scatter on both sides of the linear regression line appears more balanced, the plot of residuals against the predicted values of the linear model appears more random, and the residual values appear closer to the red normal QQ line.

Table 3: Simple Linear Regression after Transformation

	Estimate	Std. Error	t value	Pr(< \| t \| )
(Intercept)	0.861379	0.004831	178.319	< 2e-16
Infection Risk	0.007268	0.001061	6.853	4.25e-10

Figure 3a) Scatterplot of Predictor vs. Response; b) Scatterplot of Predictor vs. Residuals; c) Normal Q-Q Plot of Residuals

The new model fit after a Box-Cox transformation yields an intercept estimate (\(B_0\)) of 0.8613789 and a slope estimate (\(B_1\)) of 0.007268 resulting in a final linear regression model of:

\(\hat{Y_i}\) = 0.8613789 + 0.007268 \(X_i\)

The assumptions that underly this final linear regression model include the following:

E[\(\hat{e_i}\)]=0

Var[\(\hat{e_i}\)]=\(σ^2\)

\(\hat{e_i}\) uncorrelated

\(\hat{e_i}\) normally distributed

These assumptions for the final model are reasonable, because Figure 3 displays that our final model is a good fit for the data, the residuals (\(\hat{e_i}\)) have a relatively normal distribution, and the plot of residuals against the predictor values is random implying that there is indeed a linear relationship.

4. Inference and Results

According to the summary of our new linear regression model, the calculated p-value is 4.248 x \(10^{-10}\). A hypothesis test for \(H_0: B_1\) = 0 results in the rejection of the null hypothesis since the p-value is less than the significance level (α = 0.05). Therefore, there is strong statistical evidence of there being a positive and linear relationship between the predictor and the response variables. Specifically, we claim that for every infection rate percent increase in the hospital, there is a 0.0007268 Box-Cox transformed days increase in the length of stay for the patients.

From Table 4, we are 95% confident that the Length of Stay for the patients when the hospital has an Infection Risk of 5% will be between 9.49 and 10.08 days.

Table 4: 95% Confidence Interval for \({X_i} = 5\)

	lower	upper
Length of Stay (days)	9.49	10.08

5. Discussion

There is enough evidence to support our hypothesis that there is a positive and linear relationship between the Infection Risk of hospitals and the Length of Stay for patients. Nevertheless, the correlation does not imply causation. Further analysis needs to be done on the relationship between Length of Stay for patients and Age, as well as the many other predictor variables that could potentially present a relationship with our response variable. For future research regarding the relationship between the Infection Risk and the Length of Stay for patients, we should collect more data around what initially appeared to be outliers to validate their existence.