## Days Infect
## 1 7.13 4.1
## 2 8.82 1.6
## 3 8.34 2.7
## 4 8.95 5.6
## 5 11.20 5.7
## 6 9.76 5.1
Today, we will be exploring a hospital dataset collected as part of the SENIC project. In this data exploration, we will attempt to build a model to predict a patient’s risk of infection based on the number of days they are hospitalized, based on a varying amount of days.
This is a potentially very important question to answer as it can help hospitals optimize treatment processes based on a patient’s proximity to the critical period of potential infection.
In order to explore our data, we will be using a linear regression model with hypothesis testing to confirm the veracity of our results.
## Days Infect
## Min. : 6.700 Min. :1.300
## 1st Qu.: 8.355 1st Qu.:3.700
## Median : 9.350 Median :4.400
## Mean : 9.572 Mean :4.383
## 3rd Qu.:10.445 3rd Qu.:5.150
## Max. :19.560 Max. :7.800
## [1] 1.737769
## [1] 1.29055
Here, we can see that the average number of days a patient spends in the hospital is 9.572 days while the average probability of a patient becoming infected is 4.383%.
The respective minimums are 6.7 days and 1.3%, while the maximums are 19.56 days and 7.8%. The median is 9.35 days spent at the hospital with a 4.4% probability of infection.
In addition, we can see that the standard deviation of Days is 1.73 days, while the standard deviation of Probability of Infection is 1.29%.
Now, lets explore the boxplots of each variable to see how these distributions are developed visually.
From the boxplot of Days Hospitalized, we can see that there are only
two outliers: one patient admitted for 13 days and one patient admitted
for 19 days. We will later remove these to create a more accurate linear
regression model.
From this graph, we can see a few more outliers in infection percentage
than in days hospital. There are outliers on both sides of the IQR, with
a <2% probability of infection and three patients with an infection
rate between ~7-8%.
Let’s look at histograms of the probability density of the infection rates. Essentially, we will see what probability of infection is most populous in our dataset.
Here, we can see that 40% of our dataset contains patients with a 4-5%
probability of being infected. This probability density follows a
relatively normal distribution. ?mode()
In this histogram, we can see that the most common value of days spent
in the hospital is 8-9 days, as it has 25% of the density of the
probability distribution. Now, we will plot our data on a scatter plot
to see how the data aligns with one another.
Generally speaking, it appears that as the number of days hospitalized
increases, the estimated probability of infection rate increases. We
will now begin to develop our model to see how strong of a correlation
there is between the two variables.
Now, we will begin making our linear model and take note of our test
statistics and confidence intervals. First, we will begin by removing
our outliers using the method previously used in class.
Now, we will look at a scatter plot with the new dataset “new.hospital”, which is our previous dataset with outliers removed.
Immediately, the graph looks much better! The bounds are adjusted to
better fit the data, so now the data looks much more correlated and
closely grouped within the graph’s window. Now, let’s build a model with
our updated data.
Now that we have created a linear model with our regularized datasets,
we can begin to look at some of the advanced analytics behind our model
to decide if it displays a significant linear relationship between days
hospitalized and estimated probability of infection.
Let’s look at some of the basic statistics behind our model, and evaluate how they compare based on the expected results for a fitting linear model.
##
## Call:
## lm(formula = newInfect ~ newDays, data = new.hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.51493 -0.43588 0.08944 0.57779 2.17822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.09273 0.74135 1.474 0.144
## newDays 0.34265 0.07846 4.367 3.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9723 on 90 degrees of freedom
## Multiple R-squared: 0.1749, Adjusted R-squared: 0.1657
## F-statistic: 19.07 on 1 and 90 DF, p-value: 3.356e-05
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0927322 0.74134994 1.473976 1.439773e-01
## newDays 0.3426528 0.07845918 4.367276 3.356447e-05
## [1] 0.9723276
## 2.5 % 97.5 %
## 0.1867800 0.4985257
Our model is y = 0.34625X + 1.09273, with B0 = 1.09273 and B1 = 0.34625. Our practical interpretation of this model is that for every day spent hospitalized, a patient’s probability of infection goes up by 0.346%. Upon entering the hospital, every patient begins with a 1.09% chance of infection (as denoted by our intercept value).
Our standard error is 0.972, while s{b1} = 0.0784. Given B1 and s{b1}, our test statistic, t = B1/ s{b1}, is 4.4367.
Our p-value, Pr(>|t|) in the table, is 3.356 * 10^-05. This is great! It means that our model will hold up under even the most rigorous confidence intervals, as it is far lower than any alpha value we may choose.
We will use the same hypotheses for each of the three predictions.
H0: There is not a significant positive linear relationship between days hospitalized and estimated probability of infection. H1: There is a significant positive linear relationship between days hospitalized and estimated probability of infection.
Here, we will use a prediction interval to provide an accurate estimation for a single value Y* at x = 8.
## [1] 3.833955
The prediction for a patient that stays 8 days is that they have a 3.834% risk of infection.
days8 <- data.frame(newDays=8)
predict(lin.model, newdata = days8, interval = 'prediction', level = 0.95)
## fit lwr upr
## 1 3.833955 1.88025 5.78766
Based on the variables lwr and upr provided by the predict function, we can say that we are 95% confident the prediction for a patient who is hospitalized for 8 days has a risk of infection rate between 1.88% and 5.79%.
Here, we can use the Y_bar estimated model to predict the mean average response at x = 15 days. We will use a confidence interval to estimate Ybar, the average value of Y at some X = x.
## [1] 6.232524
We get a predicted value of 6.232% risk of infection. To obtain the average infection risk, we will create a confidence interval for Ybar using the predict() function.
## fit lwr upr
## 1 6.232525 5.330645 7.134404
For a patient who stays 15 days, the predicted average infection risk is 6.286%. Based on the variables lwr and upr provided by the predict function, we can say that we are 95% confident that the average infection risk for patients who stay for 15 days is between 5.33% and 7.13%.
Here, we will use a prediction interval to provide an accurate estimation for a single value Y* at x = 40.
## [1] 14.79884
The prediction for a patient that stays 40 days is that they have a 14.799% risk of infection.
## fit lwr upr
## 1 14.79885 9.643121 19.95457
Based on the variables lwr and upr provided by the predict function, we can say that we are 95% confident the prediction for a patient who is hospitalized for 40 days has a risk of infection rate between 9.643% and 19.954%.
Because our p-value for the model, 0.0003356, is much lower than the alpha of a = 0.05, we can successfully reject the null hypothesis and declare that there is a significant positive linear relationship between days hospitalized and risk of infection rate.
In terms of our first predicted interval for x = 8, I am not surprised at the wide range provided by the prediction interval upon looking at the scatterplot. At x = 8 days, there is quite a large range of values. As prediction intervals for a single value are typically much larger than an average interval, this also explains why the range is much wider than our confidence interval for question 2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 2.900 3.700 3.748 4.400 6.300
As there is plenty of data points around this range, I have faith that
our estimate closely fits reality.
At x = 15 days, we venture a little bit further along from our sample data (or at least our trimmed data). With our model built without outliers, there are no data points for x = 15 days, so there is not as much to compare our estimation to.
With a confidence interval between 5.33% and 7.13 and estimated value of 6.286%, I feel confident in our calculation From our sample outliers, we had a patient that spent 13.59 days hospitalized with an estimated risk of infection of 6.1% and another patient that spent 19.56 days hospitalized with an estimated risk of infection of 6.5%. Our estimate value fits very snuggly in between those two data points, and our interval provides a wide range for possible values within that area.
At x = 40 days, our model becomes somewhat more unreliable in its estimation as there are no data points in the sample remotely close to this estimation. As a result, our prediction interval is very wide – spanning over 10% – from 9.643% to 19.954%. Although our p-value for the model supports the statistical validity, this is the largest and most unfounded estimation of the three.
From our statistical analysis, we can conclude that there is a significant positive linear relationship between days hospitalized and the estimated probability of a patient becoming infected.
We have proved this as a result of our calculated linear model having a p-value far below an alpha value of 0.05.
While our confidence interval for x = 15 days has a very nicely fitted small bounds, our prediction intervals for x = 8 days and x = 40 days are much larger. This is largely because of the extra factoring of error required, but is also anecdotally noticeable for x = 8’s large spread in values and x = 40’s distance from the rest of the sample data.
knitr::opts_chunk$set(echo = TRUE)
hospital <- read.csv("C:/Users/barle/Downloads/E_am_1_export/Exam 1 R/hospital.csv")
#Hospital dataset
head(hospital)
Days = hospital$Days
Infect = hospital$Infect
summary(hospital)
sdDays <- sd(Days)
#Standard Deviation of Days
sdDays
sdInfect <- sd(Infect)
#Standard Deviation of Risk of Infection Rate
sdInfect
boxplot(Days, main = "Frequency",
xlab = "Days Hospitalized", horizontal = TRUE, col = "orange")
boxplot(Infect, main = "Frequency",
xlab = "Probability of Infection (%)", horizontal = TRUE, col = "cyan")
hist(hospital$Infect, freq = FALSE, main = "Probability Density of Infection Rate",
xlab = "Probability of Infection", col = "orange")
hist(hospital$Days, breaks = 15, freq = FALSE, main = "Probability Density of Days Spent in Hospital",
xlab = "Days spent in hospital", col = "cyan")
plot(Days, Infect, xlab = "Days Hospitalized", ylab = "Estimated Probability of Infection (%)", main = "Relationship between
Days Hospitalized and Estimated Probability of Infection", pch = 20,
col = rgb(red = 0, green = 0, blue = 0, alpha = 0.7))
#Q3 + 1/5#(IQR)
#Q1 - 1.5*(IQR)
Q1.Days = quantile(Days, 0.25)
Q3.Days = quantile(Days, 0.75)
lower.Days = Q1.Days - 1.5*(Q3.Days - Q1.Days)
upper.Days = Q3.Days + 1.5*(Q3.Days - Q1.Days)
outlier.Days = which(Days > upper.Days | Days < lower.Days)
Q1.Infect = quantile(Infect, 0.25)
Q3.Infect = quantile(Infect, 0.75)
lower.Infect = Q1.Infect - 1.5*(Q3.Infect - Q1.Infect)
upper.Infect = Q3.Infect + 1.5*(Q3.Infect - Q1.Infect)
outlier.Infect = which(Infect > upper.Infect | Infect < lower.Infect)
new.hospital = hospital[-c(outlier.Days, outlier.Infect), ]
newDays = new.hospital$Days
newInfect = new.hospital$Infect
plot(newDays, newInfect, xlab = "Days Hospitalized", ylab = "Estimated Probability of Infection (%)", main = "Relationship between
Days Hospitalized and Estimated Probability of Infection", pch = 20,
col = rgb(red = 0, green = 0, blue = 0, alpha = 0.7))
lin.model = lm(newInfect ~ newDays, data=new.hospital)
plot(newDays, newInfect, xlab = "Days Hospitalized", ylab = "Estimated Probability of Infection (%)", main = "Relationship between
Days Hospitalized and Estimated Probability of Infection", pch = 20,
col = rgb(red = 0, green = 0, blue = 0, alpha = 0.7))
abline(lin.model, col = "red")
summary(lin.model)
summary(lin.model)$coefficients
standard_error = summary(lin.model)$sigma
standard_error
CI = confint(lin.model, level = 0.95)
slope.CI = CI[2,]
slope.CI
Ybar = 0.3426528 * (8) + 1.0927322
Ybar
days8 <- data.frame(newDays=8)
predict(lin.model, newdata = days8, interval = 'prediction', level = 0.95)
Ybar2 = 0.3426528 * (15) + 1.0927322
Ybar2
days15 <- data.frame(newDays=15)
predict(lin.model, newdata = days15, interval = 'confidence', level = 0.95)
Ybar = 0.3426528 * (40) + 1.0927322
Ybar
days40 <- data.frame(newDays=40)
predict(lin.model, newdata = days40, interval = 'prediction', level = 0.95)
eight_days <- subset(new.hospital, subset = ((Days >= 8) & (Days < 9)))
summary(eight_days$Infect)
boxplot(eight_days$Infect, main = "Frequency",
xlab = "Probability of Infection (%)", horizontal = TRUE, col = "cyan")