Exam I R Problem

Introduction

Data Import and Exploration

##    Days Infect
## 1  7.13    4.1
## 2  8.82    1.6
## 3  8.34    2.7
## 4  8.95    5.6
## 5 11.20    5.7
## 6  9.76    5.1

Today, we will be exploring a hospital dataset collected as part of the SENIC project. In this data exploration, we will attempt to build a model to predict a patient’s risk of infection based on the number of days they are hospitalized, based on a varying amount of days.

This is a potentially very important question to answer as it can help hospitals optimize treatment processes based on a patient’s proximity to the critical period of potential infection.

In order to explore our data, we will be using a linear regression model with hypothesis testing to confirm the veracity of our results.

Data Summary

##       Days            Infect     
##  Min.   : 6.700   Min.   :1.300  
##  1st Qu.: 8.355   1st Qu.:3.700  
##  Median : 9.350   Median :4.400  
##  Mean   : 9.572   Mean   :4.383  
##  3rd Qu.:10.445   3rd Qu.:5.150  
##  Max.   :19.560   Max.   :7.800

## [1] 1.737769

## [1] 1.29055

Here, we can see that the average number of days a patient spends in the hospital is 9.572 days while the average probability of a patient becoming infected is 4.383%.

The respective minimums are 6.7 days and 1.3%, while the maximums are 19.56 days and 7.8%. The median is 9.35 days spent at the hospital with a 4.4% probability of infection.

In addition, we can see that the standard deviation of Days is 1.73 days, while the standard deviation of Probability of Infection is 1.29%.

Now, lets explore the boxplots of each variable to see how these distributions are developed visually.

Boxplots

From the boxplot of Days Hospitalized, we can see that there are only two outliers: one patient admitted for 13 days and one patient admitted for 19 days. We will later remove these to create a more accurate linear regression model.

From this graph, we can see a few more outliers in infection percentage than in days hospital. There are outliers on both sides of the IQR, with a <2% probability of infection and three patients with an infection rate between ~7-8%.

Let’s look at histograms of the probability density of the infection rates. Essentially, we will see what probability of infection is most populous in our dataset.

Histograms

Here, we can see that 40% of our dataset contains patients with a 4-5% probability of being infected. This probability density follows a relatively normal distribution. ?mode() In this histogram, we can see that the most common value of days spent in the hospital is 8-9 days, as it has 25% of the density of the probability distribution. Now, we will plot our data on a scatter plot to see how the data aligns with one another.

Scatter Plot

Generally speaking, it appears that as the number of days hospitalized increases, the estimated probability of infection rate increases. We will now begin to develop our model to see how strong of a correlation there is between the two variables.

Diagnostics

Now, we will begin making our linear model and take note of our test statistics and confidence intervals. First, we will begin by removing our outliers using the method previously used in class.

Now, we will look at a scatter plot with the new dataset “new.hospital”, which is our previous dataset with outliers removed.

Immediately, the graph looks much better! The bounds are adjusted to better fit the data, so now the data looks much more correlated and closely grouped within the graph’s window. Now, let’s build a model with our updated data.

Now that we have created a linear model with our regularized datasets, we can begin to look at some of the advanced analytics behind our model to decide if it displays a significant linear relationship between days hospitalized and estimated probability of infection.

Analysis

Let’s look at some of the basic statistics behind our model, and evaluate how they compare based on the expected results for a fitting linear model.

## 
## Call:
## lm(formula = newInfect ~ newDays, data = new.hospital)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51493 -0.43588  0.08944  0.57779  2.17822 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09273    0.74135   1.474    0.144    
## newDays      0.34265    0.07846   4.367 3.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9723 on 90 degrees of freedom
## Multiple R-squared:  0.1749, Adjusted R-squared:  0.1657 
## F-statistic: 19.07 on 1 and 90 DF,  p-value: 3.356e-05

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 1.0927322 0.74134994 1.473976 1.439773e-01
## newDays     0.3426528 0.07845918 4.367276 3.356447e-05

## [1] 0.9723276

##     2.5 %    97.5 % 
## 0.1867800 0.4985257

Model Information

Our model is y = 0.34625X + 1.09273, with B0 = 1.09273 and B1 = 0.34625. Our practical interpretation of this model is that for every day spent hospitalized, a patient’s probability of infection goes up by 0.346%. Upon entering the hospital, every patient begins with a 1.09% chance of infection (as denoted by our intercept value).

Our standard error is 0.972, while s{b1} = 0.0784. Given B1 and s{b1}, our test statistic, t = B1/ s{b1}, is 4.4367.

Our p-value, Pr(>|t|) in the table, is 3.356 * 10^-05. This is great! It means that our model will hold up under even the most rigorous confidence intervals, as it is far lower than any alpha value we may choose.

Predicted Values and Hypotheses

We will use the same hypotheses for each of the three predictions.

H0: There is not a significant positive linear relationship between days hospitalized and estimated probability of infection. H1: There is a significant positive linear relationship between days hospitalized and estimated probability of infection.

1

Predict the infection risk for a patient who stayed 8 days.

Here, we will use a prediction interval to provide an accurate estimation for a single value Y* at x = 8.

## [1] 3.833955

The prediction for a patient that stays 8 days is that they have a 3.834% risk of infection.

days8 <- data.frame(newDays=8)
predict(lin.model, newdata = days8, interval = 'prediction', level = 0.95)

##        fit     lwr     upr
## 1 3.833955 1.88025 5.78766

Based on the variables lwr and upr provided by the predict function, we can say that we are 95% confident the prediction for a patient who is hospitalized for 8 days has a risk of infection rate between 1.88% and 5.79%.

2

Predict the average infection risk for a patients who stay 15 days.

Here, we can use the Y_bar estimated model to predict the mean average response at x = 15 days. We will use a confidence interval to estimate Ybar, the average value of Y at some X = x.

## [1] 6.232524

We get a predicted value of 6.232% risk of infection. To obtain the average infection risk, we will create a confidence interval for Ybar using the predict() function.

##        fit      lwr      upr
## 1 6.232525 5.330645 7.134404

For a patient who stays 15 days, the predicted average infection risk is 6.286%. Based on the variables lwr and upr provided by the predict function, we can say that we are 95% confident that the average infection risk for patients who stay for 15 days is between 5.33% and 7.13%.

3

Predict the infection risk for a patient who stayed 40 days.

Here, we will use a prediction interval to provide an accurate estimation for a single value Y* at x = 40.

## [1] 14.79884

The prediction for a patient that stays 40 days is that they have a 14.799% risk of infection.

##        fit      lwr      upr
## 1 14.79885 9.643121 19.95457

Based on the variables lwr and upr provided by the predict function, we can say that we are 95% confident the prediction for a patient who is hospitalized for 40 days has a risk of infection rate between 9.643% and 19.954%.

Because our p-value for the model, 0.0003356, is much lower than the alpha of a = 0.05, we can successfully reject the null hypothesis and declare that there is a significant positive linear relationship between days hospitalized and risk of infection rate.

Interpretation

In terms of our first predicted interval for x = 8, I am not surprised at the wide range provided by the prediction interval upon looking at the scatterplot. At x = 8 days, there is quite a large range of values. As prediction intervals for a single value are typically much larger than an average interval, this also explains why the range is much wider than our confidence interval for question 2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   2.900   3.700   3.748   4.400   6.300

As there is plenty of data points around this range, I have faith that our estimate closely fits reality.

At x = 15 days, we venture a little bit further along from our sample data (or at least our trimmed data). With our model built without outliers, there are no data points for x = 15 days, so there is not as much to compare our estimation to.

With a confidence interval between 5.33% and 7.13 and estimated value of 6.286%, I feel confident in our calculation From our sample outliers, we had a patient that spent 13.59 days hospitalized with an estimated risk of infection of 6.1% and another patient that spent 19.56 days hospitalized with an estimated risk of infection of 6.5%. Our estimate value fits very snuggly in between those two data points, and our interval provides a wide range for possible values within that area.

At x = 40 days, our model becomes somewhat more unreliable in its estimation as there are no data points in the sample remotely close to this estimation. As a result, our prediction interval is very wide – spanning over 10% – from 9.643% to 19.954%. Although our p-value for the model supports the statistical validity, this is the largest and most unfounded estimation of the three.

Conclusion

From our statistical analysis, we can conclude that there is a significant positive linear relationship between days hospitalized and the estimated probability of a patient becoming infected.

We have proved this as a result of our calculated linear model having a p-value far below an alpha value of 0.05.

While our confidence interval for x = 15 days has a very nicely fitted small bounds, our prediction intervals for x = 8 days and x = 40 days are much larger. This is largely because of the extra factoring of error required, but is also anecdotally noticeable for x = 8’s large spread in values and x = 40’s distance from the rest of the sample data.

Code Appendix

knitr::opts_chunk$set(echo = TRUE)
hospital <- read.csv("C:/Users/barle/Downloads/E_am_1_export/Exam 1 R/hospital.csv")
#Hospital dataset
head(hospital)
Days = hospital$Days
Infect = hospital$Infect
summary(hospital)
sdDays <- sd(Days)
#Standard Deviation of Days
sdDays
sdInfect <- sd(Infect)
#Standard Deviation of Risk of Infection Rate
sdInfect

boxplot(Days, main = "Frequency",
        xlab = "Days Hospitalized", horizontal = TRUE, col = "orange")


boxplot(Infect, main = "Frequency",
        xlab = "Probability of Infection (%)", horizontal = TRUE, col = "cyan")

hist(hospital$Infect, freq = FALSE, main = "Probability Density of Infection Rate",
     xlab = "Probability of Infection", col = "orange")
hist(hospital$Days, breaks = 15, freq = FALSE, main = "Probability Density of Days Spent in Hospital",
     xlab = "Days spent in hospital", col = "cyan")
plot(Days, Infect, xlab = "Days Hospitalized", ylab = "Estimated Probability of Infection (%)", main = "Relationship between
     Days Hospitalized and Estimated Probability of Infection", pch = 20,
     col = rgb(red = 0, green = 0, blue = 0, alpha = 0.7)) 

#Q3 + 1/5#(IQR)
#Q1 - 1.5*(IQR)

Q1.Days = quantile(Days, 0.25)
Q3.Days = quantile(Days, 0.75)

lower.Days = Q1.Days - 1.5*(Q3.Days - Q1.Days)
upper.Days = Q3.Days + 1.5*(Q3.Days - Q1.Days)

outlier.Days = which(Days > upper.Days | Days < lower.Days)

Q1.Infect = quantile(Infect, 0.25)
Q3.Infect = quantile(Infect, 0.75)

lower.Infect = Q1.Infect - 1.5*(Q3.Infect - Q1.Infect)
upper.Infect = Q3.Infect + 1.5*(Q3.Infect - Q1.Infect)

outlier.Infect = which(Infect > upper.Infect | Infect < lower.Infect)

new.hospital = hospital[-c(outlier.Days, outlier.Infect), ]

newDays = new.hospital$Days
newInfect = new.hospital$Infect

plot(newDays, newInfect, xlab = "Days Hospitalized", ylab = "Estimated Probability of Infection (%)", main = "Relationship between
     Days Hospitalized and Estimated Probability of Infection", pch = 20,
     col = rgb(red = 0, green = 0, blue = 0, alpha = 0.7))
lin.model = lm(newInfect ~ newDays, data=new.hospital)
plot(newDays, newInfect, xlab = "Days Hospitalized", ylab = "Estimated Probability of Infection (%)", main = "Relationship between
     Days Hospitalized and Estimated Probability of Infection", pch = 20,
     col = rgb(red = 0, green = 0, blue = 0, alpha = 0.7))
abline(lin.model, col = "red")
summary(lin.model)
summary(lin.model)$coefficients


standard_error = summary(lin.model)$sigma
standard_error

CI = confint(lin.model, level = 0.95)
slope.CI = CI[2,]
slope.CI

Ybar = 0.3426528 * (8) + 1.0927322 
Ybar

days8 <- data.frame(newDays=8)
predict(lin.model, newdata = days8, interval = 'prediction', level = 0.95)
Ybar2 = 0.3426528 * (15) + 1.0927322 
Ybar2

days15 <- data.frame(newDays=15)
predict(lin.model, newdata = days15, interval = 'confidence', level = 0.95)
Ybar = 0.3426528 * (40) + 1.0927322 
Ybar

days40 <- data.frame(newDays=40)
predict(lin.model, newdata = days40, interval = 'prediction', level = 0.95)

eight_days <- subset(new.hospital, subset = ((Days >= 8) & (Days < 9)))
summary(eight_days$Infect)
boxplot(eight_days$Infect, main = "Frequency",
        xlab = "Probability of Infection (%)", horizontal = TRUE, col = "cyan")

Exam1R

Vincent Barletta

2022-10-17