The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Nursing Home Data”, and is part of Smith, Piland, and Fisher’s 1992 study of Urban vs Rural Nursing Facilities.
x <- read.delim("~/Spring2015/Applied Regression/nursing_home.txt")
head(x)
## NUM_BEDS MED_INPAT_DAYS TOT_PAT_DAYS PAT_CARE_REV NURS_SAL EXP RURAL
## 1 244 128 385 23521 5230 5334 0
## 2 59 155 203 9160 2459 493 1
## 3 120 281 392 21900 6304 6115 0
## 4 120 291 419 22354 6590 6346 0
## 5 120 238 363 17421 5362 6225 0
## 6 65 180 234 10531 3622 449 1
In this dataset, there are 52 observations with 7 different variables each. The data looks at the number of beds in the home, annual medical in-patient days, total annual patients days, total annual patient care revenue, annual nursing salaries, total annual expenditures, and whether or not the home was rural.
All patient days and in hundreds, and all numeric values are in hundreds of dollars.
summary(x)
## NUM_BEDS MED_INPAT_DAYS TOT_PAT_DAYS PAT_CARE_REV
## Min. : 25.00 Min. : 48.0 Min. : 83.0 Min. : 2853
## 1st Qu.: 62.00 1st Qu.:125.2 1st Qu.:198.0 1st Qu.: 8857
## Median : 88.00 Median :164.5 Median :279.0 Median :12384
## Mean : 93.27 Mean :183.9 Mean :280.2 Mean :14210
## 3rd Qu.:120.00 3rd Qu.:229.0 3rd Qu.:363.8 3rd Qu.:18777
## Max. :244.00 Max. :514.0 Max. :776.0 Max. :36029
## NURS_SAL EXP RURAL
## Min. :1288 Min. : 137 Min. :0.0000
## 1st Qu.:2336 1st Qu.:1229 1st Qu.:0.0000
## Median :3696 Median :2378 Median :1.0000
## Mean :3813 Mean :2848 Mean :0.6538
## 3rd Qu.:4840 3rd Qu.:4444 3rd Qu.:1.0000
## Max. :7489 Max. :6442 Max. :1.0000
str(x)
## 'data.frame': 52 obs. of 7 variables:
## $ NUM_BEDS : int 244 59 120 120 120 65 120 90 96 120 ...
## $ MED_INPAT_DAYS: int 128 155 281 291 238 180 306 214 155 133 ...
## $ TOT_PAT_DAYS : int 385 203 392 419 363 234 372 305 169 188 ...
## $ PAT_CARE_REV : int 23521 9160 21900 22354 17421 10531 22147 14025 8812 11729 ...
## $ NURS_SAL : int 5230 2459 6304 6590 5362 3622 4406 4173 1955 3224 ...
## $ EXP : int 5334 493 6115 6346 6225 449 4998 966 1260 6442 ...
## $ RURAL : int 0 1 0 0 0 1 1 1 0 1 ...
All of the variables except the binary rural/not rural are continuous variables.
The response variable that I am interested in is the annual nursing salary, and how variation in some other factors may be able to predict the variation in annual nursing salary.
As just stated, the dependent variable I am interested in is the annual nursing salary($hundreds). The independent variable that I am interested in total annual patients days
My null hypothesis is that the variation in annual nursing salary at the nursing homes cannot be explained by anything other than randomization.
This linear model analyzes if the variation in nursing salary can be expained by the variation in total annual patients days.
We use the “lm” function to calculate the linear model.
“NURS_SAL ~ TOT_PAT_DAYS” reads as: “Nursing Salary is explained by Total Patient Days”
attach(x)
x <- x[order(TOT_PAT_DAYS),]
model <- lm(x$NURS_SAL ~ x$TOT_PAT_DAYS)
attach(x)
## The following objects are masked from x (pos = 3):
##
## EXP, MED_INPAT_DAYS, NUM_BEDS, NURS_SAL, PAT_CARE_REV, RURAL,
## TOT_PAT_DAYS
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=1, bg='ivory4', main="Nursing Salary vs. Total Patient Days", xlab = "Total Annual Patient Days (in hundreds)", ylab = "Average Annual Nurse Salary (in hundreds of $)")
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=1, bg='ivory4', main="Nursing Salary vs. Total Patient Days with Regression Line", xlab = "Total Annual Patient Days (in hundreds)", ylab = "Average Annual Nurse Salary (in hundreds of $)")
abline(model$coef, lwd=2.5, col='dodgerblue4')
par(mfrow=c(1,1))
model.res <- resid(model)
plot(fitted(model), model.res, pch=21, cex=1, bg='ivory4',main="Plot of Fitted Values vs. Residuals (Not Standardized)", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')
The plot below shows the data graphed with the total patient days as the independent variable, and total annual nurse salary as the dependent variable. It also displays the 95% Confidence Interval, which means that for 95% of samples, the true value will exist between those two lines.
model_pred <- predict(model, interval="confidence")
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=1, bg='ivory4', main="Nursing Salary vs. Total Patient Days with Regression Line", xlab = "Total Annual Patient Days (in hundreds)", ylab = "Average Annual Nurse Salary (in hundreds of $)")
abline(model, lwd=2.5, col='dodgerblue4')
lines(TOT_PAT_DAYS, model_pred[,2], lty=2, lwd=2.5, col='dodgerblue2')
lines(TOT_PAT_DAYS, model_pred[,3], lty=2, lwd=2.5, col='dodgerblue2')
The confidence interval does seem to fan out at both the smaller and larger values. This may indicate that there are more values in the middle range, so there are more values to base the model on. Therefore, there is more confidence in that range of data.
summary(model)
##
## Call:
## lm(formula = x$NURS_SAL ~ x$TOT_PAT_DAYS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6034.0 -786.8 3.4 779.0 2530.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1829.913 507.024 3.609 0.00071 ***
## x$TOT_PAT_DAYS 7.077 1.664 4.253 9.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1436 on 50 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2509
## F-statistic: 18.09 on 1 and 50 DF, p-value: 9.231e-05
As seen in the summary of the model, the estimate of the y-intercept is 1,829.913, and the slope of the model is 7.077. This means that for an annual patient days of value 0, the value of average annual nurse salary is $1,829. With a slope of 7.077, this says that for each additional unit of total patient days (in hundreds of days), the average annual nurse salary increases by 7.077 units, where one unit is a hundred dollars.
The R^2 value for the model is 0.2656. Generally, a larger R^2 value indicates a stronger relationship between the independent and dependent variables, where an R^2 of 1 would indicate a perfect correlation, and a 0 would indicate zero correlation. With this R^2 value of 0.2656, we can say that variation in the independent variable, total patient days, can explain 26.56% of the variation in the dependent variable, annual nurse salary. This is not a particularly high R^2 value, and therefore a large percent of the variation in the dependent variable cannot be explained by the independent variable.
Another indicator of the strength of the model is the p-value, which in this case is 9.231 x 10^-5. This is a very small value, and it is the probability that the F-statistic (in this case 18.09) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-value is less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in nurse salary cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in total patient days may help to explain the variation in annual nurse salary.