The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Nursing Home Data”, and is part of Smith, Piland, and Fisher’s 1992 study of Urban vs Rural Nursing Facilities.
x <- read.delim("~/Spring2015/Applied Regression/nursing_home.txt")
head(x)
## NUM_BEDS MED_INPAT_DAYS TOT_PAT_DAYS PAT_CARE_REV NURS_SAL EXP RURAL
## 1 244 128 385 23521 5230 5334 0
## 2 59 155 203 9160 2459 493 1
## 3 120 281 392 21900 6304 6115 0
## 4 120 291 419 22354 6590 6346 0
## 5 120 238 363 17421 5362 6225 0
## 6 65 180 234 10531 3622 449 1
In this dataset, there are 52 observations with 7 different variables each. The data looks at the number of beds in the home, annual medical in-patient days, total annual patients days, total annual patient care revenue, annual nursing salaries, total annual expenditures, and whether or not the home was rural.
All patient days and in hundreds, and all numeric values are in hundreds of dollars.
summary(x)
## NUM_BEDS MED_INPAT_DAYS TOT_PAT_DAYS PAT_CARE_REV
## Min. : 25.00 Min. : 48.0 Min. : 83.0 Min. : 2853
## 1st Qu.: 62.00 1st Qu.:125.2 1st Qu.:198.0 1st Qu.: 8857
## Median : 88.00 Median :164.5 Median :279.0 Median :12384
## Mean : 93.27 Mean :183.9 Mean :280.2 Mean :14210
## 3rd Qu.:120.00 3rd Qu.:229.0 3rd Qu.:363.8 3rd Qu.:18777
## Max. :244.00 Max. :514.0 Max. :776.0 Max. :36029
## NURS_SAL EXP RURAL
## Min. :1288 Min. : 137 Min. :0.0000
## 1st Qu.:2336 1st Qu.:1229 1st Qu.:0.0000
## Median :3696 Median :2378 Median :1.0000
## Mean :3813 Mean :2848 Mean :0.6538
## 3rd Qu.:4840 3rd Qu.:4444 3rd Qu.:1.0000
## Max. :7489 Max. :6442 Max. :1.0000
str(x)
## 'data.frame': 52 obs. of 7 variables:
## $ NUM_BEDS : int 244 59 120 120 120 65 120 90 96 120 ...
## $ MED_INPAT_DAYS: int 128 155 281 291 238 180 306 214 155 133 ...
## $ TOT_PAT_DAYS : int 385 203 392 419 363 234 372 305 169 188 ...
## $ PAT_CARE_REV : int 23521 9160 21900 22354 17421 10531 22147 14025 8812 11729 ...
## $ NURS_SAL : int 5230 2459 6304 6590 5362 3622 4406 4173 1955 3224 ...
## $ EXP : int 5334 493 6115 6346 6225 449 4998 966 1260 6442 ...
## $ RURAL : int 0 1 0 0 0 1 1 1 0 1 ...
All of the variables except the binary rural/not rural are continuous variables.
The response variable that I am interested in is the annual nursing salary, and how variation in some other factors may be able to predict the variation in annual nursing salary.
As just stated, the dependent variable I am interested in is the annual nursing salary($hundreds). The independent variable that I am interested in total annual patients days
My null hypothesis is that the variation in annual nursing salary at the nursing homes cannot be explained by anything other than randomization.
This linear model analyzes if the variation in nursing salary can be expained by the variation in total annual patients days.
We use the “lm” function to calculate the linear model.
“NURS_SAL ~ TOT_PAT_DAYS” reads as: “Nursing Salary is explained by Total Patient Days”
attach(x)
model <- lm(NURS_SAL ~ TOT_PAT_DAYS)
attach(x)
## The following objects are masked from x (pos = 3):
##
## EXP, MED_INPAT_DAYS, NUM_BEDS, NURS_SAL, PAT_CARE_REV, RURAL,
## TOT_PAT_DAYS
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=0.5, bg='green', main="Nursing Salary vs. Total Patient Days")
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=0.5, bg='green', main="Nursing Salary vs. Total Patient Days with Regression Line")
abline(model$coef, lwd=2)
par(mfrow=c(1,1))
model.res <- resid(model)
plot(fitted(model), model.res, main="Plot of Fitted Values vs. Residuals")
abline(0,0)
model_pred <- predict(model, interval="confidence")
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=0.5, bg='green', main="Nursing Salary vs. Total Patient Days with Regression Line")
abline(model)
lines(TOT_PAT_DAYS, model_pred[,2], lty=2)
lines(TOT_PAT_DAYS, model_pred[,3], lty=2)
summary(model)
##
## Call:
## lm(formula = NURS_SAL ~ TOT_PAT_DAYS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6034.0 -786.8 3.4 779.0 2530.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1829.913 507.024 3.609 0.00071 ***
## TOT_PAT_DAYS 7.077 1.664 4.253 9.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1436 on 50 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2509
## F-statistic: 18.09 on 1 and 50 DF, p-value: 9.231e-05