Recipe 1: Simple Linear Regression

Nursing Homes

Jane Braun

RPI

February 26th Version 2.0

1. Data

Data Selection.

The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Nursing Home Data”, and is part of Smith, Piland, and Fisher’s 1992 study of Urban vs Rural Nursing Facilities.

x <- read.delim("~/Spring2015/Applied Regression/nursing_home.txt")

head(x)

##   NUM_BEDS MED_INPAT_DAYS TOT_PAT_DAYS PAT_CARE_REV NURS_SAL  EXP RURAL
## 1      244            128          385        23521     5230 5334     0
## 2       59            155          203         9160     2459  493     1
## 3      120            281          392        21900     6304 6115     0
## 4      120            291          419        22354     6590 6346     0
## 5      120            238          363        17421     5362 6225     0
## 6       65            180          234        10531     3622  449     1

Data Organization

In this dataset, there are 52 observations with 7 different variables each. The data looks at the number of beds in the home, annual medical in-patient days, total annual patients days, total annual patient care revenue, annual nursing salaries, total annual expenditures, and whether or not the home was rural.

All patient days and in hundreds, and all numeric values are in hundreds of dollars.

summary(x)

##     NUM_BEDS      MED_INPAT_DAYS   TOT_PAT_DAYS    PAT_CARE_REV  
##  Min.   : 25.00   Min.   : 48.0   Min.   : 83.0   Min.   : 2853  
##  1st Qu.: 62.00   1st Qu.:125.2   1st Qu.:198.0   1st Qu.: 8857  
##  Median : 88.00   Median :164.5   Median :279.0   Median :12384  
##  Mean   : 93.27   Mean   :183.9   Mean   :280.2   Mean   :14210  
##  3rd Qu.:120.00   3rd Qu.:229.0   3rd Qu.:363.8   3rd Qu.:18777  
##  Max.   :244.00   Max.   :514.0   Max.   :776.0   Max.   :36029  
##     NURS_SAL         EXP           RURAL       
##  Min.   :1288   Min.   : 137   Min.   :0.0000  
##  1st Qu.:2336   1st Qu.:1229   1st Qu.:0.0000  
##  Median :3696   Median :2378   Median :1.0000  
##  Mean   :3813   Mean   :2848   Mean   :0.6538  
##  3rd Qu.:4840   3rd Qu.:4444   3rd Qu.:1.0000  
##  Max.   :7489   Max.   :6442   Max.   :1.0000

str(x)

## 'data.frame':    52 obs. of  7 variables:
##  $ NUM_BEDS      : int  244 59 120 120 120 65 120 90 96 120 ...
##  $ MED_INPAT_DAYS: int  128 155 281 291 238 180 306 214 155 133 ...
##  $ TOT_PAT_DAYS  : int  385 203 392 419 363 234 372 305 169 188 ...
##  $ PAT_CARE_REV  : int  23521 9160 21900 22354 17421 10531 22147 14025 8812 11729 ...
##  $ NURS_SAL      : int  5230 2459 6304 6590 5362 3622 4406 4173 1955 3224 ...
##  $ EXP           : int  5334 493 6115 6346 6225 449 4998 966 1260 6442 ...
##  $ RURAL         : int  0 1 0 0 0 1 1 1 0 1 ...

Continuous variables

All of the variables except the binary rural/not rural are continuous variables.

Response variables

The response variable that I am interested in is the annual nursing salary, and how variation in some other factors may be able to predict the variation in annual nursing salary.

2. Simple Linear Model

Dependent and Independent Variable

As just stated, the dependent variable I am interested in is the annual nursing salary($hundreds). The independent variable that I am interested in total annual patients days

Null Hypothesis

My null hypothesis is that the variation in annual nursing salary at the nursing homes cannot be explained by anything other than randomization.

Linear Model

This linear model analyzes if the variation in nursing salary can be expained by the variation in total annual patients days.

We use the “lm” function to calculate the linear model.

“NURS_SAL ~ TOT_PAT_DAYS” reads as: “Nursing Salary is explained by Total Patient Days”

attach(x)
x <- x[order(TOT_PAT_DAYS),]
model <- lm(x$NURS_SAL ~ x$TOT_PAT_DAYS)

2. Plots

Scattergram

attach(x)

## The following objects are masked from x (pos = 3):
## 
##     EXP, MED_INPAT_DAYS, NUM_BEDS, NURS_SAL, PAT_CARE_REV, RURAL,
##     TOT_PAT_DAYS

plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=1, bg='ivory4', main="Nursing Salary vs. Total Patient Days", xlab = "Total Annual Patient Days (in hundreds)", ylab = "Average Annual Nurse Salary (in hundreds of $)")

Regression Line

plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=1, bg='ivory4', main="Nursing Salary vs. Total Patient Days with Regression Line", xlab = "Total Annual Patient Days (in hundreds)", ylab = "Average Annual Nurse Salary (in hundreds of $)")
abline(model$coef, lwd=2.5, col='dodgerblue4')

par(mfrow=c(1,1))
model.res <- resid(model)
plot(fitted(model), model.res, pch=21, cex=1, bg='ivory4',main="Plot of Fitted Values vs. Residuals  (Not Standardized)", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

95% Confidence Intervals of the Regression Line, b0 and b1

The plot below shows the data graphed with the total patient days as the independent variable, and total annual nurse salary as the dependent variable. It also displays the 95% Confidence Interval, which means that for 95% of samples, the true value will exist between those two lines.

model_pred <- predict(model, interval="confidence")


plot(TOT_PAT_DAYS,NURS_SAL, pch=21, cex=1, bg='ivory4', main="Nursing Salary vs. Total Patient Days with Regression Line", xlab = "Total Annual Patient Days (in hundreds)", ylab = "Average Annual Nurse Salary (in hundreds of $)")
abline(model, lwd=2.5, col='dodgerblue4')
lines(TOT_PAT_DAYS, model_pred[,2], lty=2, lwd=2.5, col='dodgerblue2')
lines(TOT_PAT_DAYS, model_pred[,3], lty=2, lwd=2.5, col='dodgerblue2')

The confidence interval does seem to fan out at both the smaller and larger values. This may indicate that there are more values in the middle range, so there are more values to base the model on. Therefore, there is more confidence in that range of data.

3. Summary of Results

summary(model)

## 
## Call:
## lm(formula = x$NURS_SAL ~ x$TOT_PAT_DAYS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6034.0  -786.8     3.4   779.0  2530.8 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1829.913    507.024   3.609  0.00071 ***
## x$TOT_PAT_DAYS    7.077      1.664   4.253 9.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1436 on 50 degrees of freedom
## Multiple R-squared:  0.2656, Adjusted R-squared:  0.2509 
## F-statistic: 18.09 on 1 and 50 DF,  p-value: 9.231e-05

4. Interpretation of Results

As seen in the summary of the model, the estimate of the y-intercept is 1,829.913, and the slope of the model is 7.077. This means that for an annual patient days of value 0, the value of average annual nurse salary is $1,829. With a slope of 7.077, this says that for each additional unit of total patient days (in hundreds of days), the average annual nurse salary increases by 7.077 units, where one unit is a hundred dollars.

The R^2 value for the model is 0.2656. Generally, a larger R^2 value indicates a stronger relationship between the independent and dependent variables, where an R^2 of 1 would indicate a perfect correlation, and a 0 would indicate zero correlation. With this R^2 value of 0.2656, we can say that variation in the independent variable, total patient days, can explain 26.56% of the variation in the dependent variable, annual nurse salary. This is not a particularly high R^2 value, and therefore a large percent of the variation in the dependent variable cannot be explained by the independent variable.

Another indicator of the strength of the model is the p-value, which in this case is 9.231 x 10^-5. This is a very small value, and it is the probability that the F-statistic (in this case 18.09) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-value is less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in nurse salary cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in total patient days may help to explain the variation in annual nurse salary.

5. Data Source

http://lib.stat.cmu.edu/DASL/Datafiles/nursinghomedat.html

Nursing Home and Nurse Salary

Jane Braun

February 26th, 2015

Recipe 1: Simple Linear Regression

Nursing Homes

Jane Braun

RPI

February 26th Version 2.0

1. Data

Data Selection.

Data Organization

Continuous variables

Response variables

2. Simple Linear Model

Dependent and Independent Variable

Null Hypothesis

Linear Model

2. Plots

Scattergram

Regression Line

95% Confidence Intervals of the Regression Line, b0 and b1

3. Summary of Results

4. Interpretation of Results

5. Data Source