Recipe 1: Simple Linear Regression

Nursing Homes

Jane Braun

RPI

February 19th Version 1.0

1. Data

Data Selection.

The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Nursing Home Data”, and is part of Smith, Piland, and Fisher’s 1992 study of Urban vs Rural Nursing Facilities.

x <- read.delim("~/Spring2015/Applied Regression/nursing_home.txt")

head(x)
##   NUM_BEDS MED_INPAT_DAYS TOT_PAT_DAYS PAT_CARE_REV NURS_SAL  EXP RURAL
## 1      244            128          385        23521     5230 5334     0
## 2       59            155          203         9160     2459  493     1
## 3      120            281          392        21900     6304 6115     0
## 4      120            291          419        22354     6590 6346     0
## 5      120            238          363        17421     5362 6225     0
## 6       65            180          234        10531     3622  449     1

Data Organization

In this dataset, there are 52 observations with 7 different variables each. The data looks at the number of beds in the home, annual medical in-patient days, total annual patients days, total annual patient care revenue, annual nursing salaries, total annual expenditures, and whether or not the home was rural.

All patient days and in hundreds, and all numeric values are in hundreds of dollars.

summary(x)
##     NUM_BEDS      MED_INPAT_DAYS   TOT_PAT_DAYS    PAT_CARE_REV  
##  Min.   : 25.00   Min.   : 48.0   Min.   : 83.0   Min.   : 2853  
##  1st Qu.: 62.00   1st Qu.:125.2   1st Qu.:198.0   1st Qu.: 8857  
##  Median : 88.00   Median :164.5   Median :279.0   Median :12384  
##  Mean   : 93.27   Mean   :183.9   Mean   :280.2   Mean   :14210  
##  3rd Qu.:120.00   3rd Qu.:229.0   3rd Qu.:363.8   3rd Qu.:18777  
##  Max.   :244.00   Max.   :514.0   Max.   :776.0   Max.   :36029  
##     NURS_SAL         EXP           RURAL       
##  Min.   :1288   Min.   : 137   Min.   :0.0000  
##  1st Qu.:2336   1st Qu.:1229   1st Qu.:0.0000  
##  Median :3696   Median :2378   Median :1.0000  
##  Mean   :3813   Mean   :2848   Mean   :0.6538  
##  3rd Qu.:4840   3rd Qu.:4444   3rd Qu.:1.0000  
##  Max.   :7489   Max.   :6442   Max.   :1.0000
str(x)
## 'data.frame':    52 obs. of  7 variables:
##  $ NUM_BEDS      : int  244 59 120 120 120 65 120 90 96 120 ...
##  $ MED_INPAT_DAYS: int  128 155 281 291 238 180 306 214 155 133 ...
##  $ TOT_PAT_DAYS  : int  385 203 392 419 363 234 372 305 169 188 ...
##  $ PAT_CARE_REV  : int  23521 9160 21900 22354 17421 10531 22147 14025 8812 11729 ...
##  $ NURS_SAL      : int  5230 2459 6304 6590 5362 3622 4406 4173 1955 3224 ...
##  $ EXP           : int  5334 493 6115 6346 6225 449 4998 966 1260 6442 ...
##  $ RURAL         : int  0 1 0 0 0 1 1 1 0 1 ...

Continuous variables

All of the variables except the binary rural/not rural are continuous variables.

Response variables

The response variable that I am interested in is the annual nursing salary, and how variation in some other factors may be able to predict the variation in annual nursing salary.

2. Simple Linear Model

Dependent and Independent Variable

As just stated, the dependent variable I am interested in is the annual nursing salary($hundreds). The independent variable that I am interested in total annual patients days

Null Hypothesis

My null hypothesis is that the variation in annual nursing salary at the nursing homes cannot be explained by anything other than randomization.

Linear Model

This linear model analyzes if the variation in nursing salary can be expained by the variation in total annual patients days.

We use the “lm” function to calculate the linear model.

“NURS_SAL ~ TOT_PAT_DAYS” reads as: “Nursing Salary is explained by Total Patient Days”

attach(x)
model <- lm(NURS_SAL ~ TOT_PAT_DAYS)

2. Plots

Scattergram

attach(x)
## The following objects are masked from x (pos = 3):
## 
##     EXP, MED_INPAT_DAYS, NUM_BEDS, NURS_SAL, PAT_CARE_REV, RURAL,
##     TOT_PAT_DAYS
plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=0.5, bg='green', main="Nursing Salary vs. Total Patient Days")

Regression Line

plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=0.5, bg='green', main="Nursing Salary vs. Total Patient Days with Regression Line")
abline(model$coef, lwd=2)

par(mfrow=c(1,1))
model.res <- resid(model)
plot(fitted(model), model.res, main="Plot of Fitted Values vs. Residuals")
abline(0,0)

95% Confidence Intervals of the Regression Line, b0 and b1

model_pred <- predict(model, interval="confidence")


plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=0.5, bg='green', main="Nursing Salary vs. Total Patient Days with Regression Line")
abline(model)
lines(TOT_PAT_DAYS, model_pred[,2], lty=2)
lines(TOT_PAT_DAYS, model_pred[,3], lty=2)

3. Summary of Results

summary(model)
## 
## Call:
## lm(formula = NURS_SAL ~ TOT_PAT_DAYS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6034.0  -786.8     3.4   779.0  2530.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1829.913    507.024   3.609  0.00071 ***
## TOT_PAT_DAYS    7.077      1.664   4.253 9.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1436 on 50 degrees of freedom
## Multiple R-squared:  0.2656, Adjusted R-squared:  0.2509 
## F-statistic: 18.09 on 1 and 50 DF,  p-value: 9.231e-05

4. Interpretation of Results