mydata=read.csv("D:/week4.csv")
str(mydata)
## 'data.frame': 384 obs. of 8 variables:
## $ Hospital.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Fiscal.Year : Factor w/ 3 levels "Y11","Y12","Y13": 1 1 1 1 1 1 1 1 1 1 ...
## $ Population.Serviced : int 25294 42186 23772 2085 67258 23752 53781 98763 47986 64634 ...
## $ Relative.Value.Units: num 402704 638252 447030 43337 1579789 ...
## $ Full.Time.Eqivalents: num 955 949 953 200 2162 ...
## $ Expenditures : num 1.15e+08 1.16e+08 1.20e+08 1.91e+07 2.46e+08 ...
## $ Quality.Score : num 0.668 0.581 0.523 0.927 0.955 ...
## $ Efficiency.Score : num 0.457 0.727 0.486 0.778 0.889 ...
head(mydata)
## Hospital.ID Fiscal.Year Population.Serviced Relative.Value.Units
## 1 1 Y11 25294 402703.73
## 2 2 Y11 42186 638251.99
## 3 3 Y11 23772 447029.54
## 4 4 Y11 2085 43337.26
## 5 5 Y11 67258 1579789.36
## 6 6 Y11 23752 673036.55
## Full.Time.Eqivalents Expenditures Quality.Score Efficiency.Score
## 1 954.91 114948144 0.6677083 0.4566
## 2 949.25 116423140 0.5812500 0.7272
## 3 952.51 119977702 0.5229167 0.4858
## 4 199.98 19056531 0.9270833 0.7783
## 5 2162.15 246166031 0.9552083 0.8895
## 6 1359.07 152125186 0.5552083 0.5784
names(mydata)
## [1] "Hospital.ID" "Fiscal.Year" "Population.Serviced"
## [4] "Relative.Value.Units" "Full.Time.Eqivalents" "Expenditures"
## [7] "Quality.Score" "Efficiency.Score"
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 3.3.3
## ResourceSelection 0.3-2 2017-02-28
library(psych)
## Warning: package 'psych' was built under R version 3.3.3
describe(mydata)
## vars n mean sd median
## Hospital.ID 1 384 64.50 37.00 64.50
## Fiscal.Year* 2 384 2.00 0.82 2.00
## Population.Serviced 3 384 24713.82 22756.42 16466.00
## Relative.Value.Units 4 384 546857.77 680047.21 246701.71
## Full.Time.Eqivalents 5 384 1060.86 1336.29 483.07
## Expenditures 6 384 124765816.51 173435868.06 52796103.84
## Quality.Score 7 384 0.71 0.12 0.72
## Efficiency.Score 8 384 0.73 0.17 0.73
## trimmed mad min max
## Hospital.ID 64.50 47.44 1.00 1.280000e+02
## Fiscal.Year* 2.00 1.48 1.00 3.000000e+00
## Population.Serviced 20647.21 13092.10 1218.00 1.198500e+05
## Relative.Value.Units 398680.18 237760.93 23218.01 3.574006e+06
## Full.Time.Eqivalents 750.11 401.38 116.29 7.518630e+03
## Expenditures 84600936.46 47444814.57 7839562.52 1.301994e+09
## Quality.Score 0.72 0.11 0.31 9.600000e-01
## Efficiency.Score 0.73 0.20 0.34 1.000000e+00
## range skew kurtosis se
## Hospital.ID 1.270000e+02 0.00 -1.21 1.89
## Fiscal.Year* 2.000000e+00 0.00 -1.51 0.04
## Population.Serviced 1.186320e+05 1.94 4.14 1161.28
## Relative.Value.Units 3.550788e+06 2.10 4.27 34703.51
## Full.Time.Eqivalents 7.402340e+03 2.55 6.94 68.19
## Expenditures 1.294154e+09 3.04 11.27 8850612.08
## Quality.Score 6.500000e-01 -0.54 0.37 0.01
## Efficiency.Score 6.600000e-01 -0.03 -0.99 0.01
kdepairs(mydata[,3:8])
library(car)
## Warning: package 'car' was built under R version 3.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
library(MASS)
## Warning: package 'MASS' was built under R version 3.3.3
myt=powerTransform(cbind(mydata$Population.Serviced, mydata$Relative.Value.Units, mydata$Full.Time.Eqivalents, mydata$Expenditures, mydata$Quality.Score, mydata$Efficiency.Score)~1 ) #searches for the optimal power transformation to obtain multivariate normality
testTransform(myt, myt$lambda) #no difference from multivariate normality with t p-value of 1.
## LRT df pval
## LR test, lambda = (0.25 0.27 -0.08 -0.02 2.03 0.44) 0 6 1
newdata=mydata #build transformations by raising each variable to the appropriate power value
newdata[,3]=newdata[,3]^myt$lambda[1]
newdata[,4]=newdata[,4]^myt$lambda[2]
newdata[,5]=newdata[,5]^myt$lambda[3]
newdata[,6]=newdata[,6]^myt$lambda[4]
newdata[,7]=newdata[,7]^myt$lambda[5]
newdata[,8]=newdata[,8]^myt$lambda[6]
kdepairs(newdata[,3:8]) #looks a lot better...The distributions are more normal and meet the assumptions of multivariate normal.
colnames(newdata)=c("ID", "FY", "Pop", "RVU", "FTE","Exp", "Qual","Eff")
cor(newdata[,3:8]) #run correlations
## Pop RVU FTE Exp Qual Eff
## Pop 1.00000000 0.93241054 -0.8815480 -0.8841802 0.2386684 0.02016303
## RVU 0.93241054 1.00000000 -0.9629433 -0.9626597 0.3161178 0.06534048
## FTE -0.88154798 -0.96294331 1.0000000 0.9866701 -0.2862752 0.17062222
## Exp -0.88418019 -0.96265966 0.9866701 1.0000000 -0.2978836 0.14762309
## Qual 0.23866843 0.31611777 -0.2862752 -0.2978836 1.0000000 0.10690171
## Eff 0.02016303 0.06534048 0.1706222 0.1476231 0.1069017 1.00000000
gr <- colorRampPalette(c("#B52127", "white", "#2171B5")) #set a color gradient
cor.plot(newdata[,3:8], number=TRUE,stars=TRUE, upper=FALSE,gr=gr) #plot the correlations
It seems a good model might be \(Exp = f(Pop+RVU+FTE)\). But are these three independent variables independent enough not to cause problems with estimation? Probably not. There is something called collinearity occurring. We see a statistically significant correlationa among the three that have large effect sizes (correlation coefficients), so regression coefficient estimates are likely to be unstable. We could analyze this post-hoc using something called the variance inflation factor (VIF). Large values of the VIF incidate problems with the model. https://statisticalhorizons.com/multicollinearity
Let’s build our model with all variables and check the VIF.
mylm = lm (Exp~Pop+RVU+FTE+Qual+Eff, data=newdata)
anova(mylm)
## Analysis of Variance Table
##
## Response: Exp
## Df Sum Sq Mean Sq F value Pr(>F)
## Pop 1 0.065132 0.065132 13283.0475 < 2.2e-16 ***
## RVU 1 0.012190 0.012190 2486.0500 < 2.2e-16 ***
## FTE 1 0.003965 0.003965 808.6429 < 2.2e-16 ***
## Qual 1 0.000007 0.000007 1.3924 0.2387
## Eff 1 0.000166 0.000166 33.7587 1.326e-08 ***
## Residuals 378 0.001853 0.000005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mylm)
##
## Call:
## lm(formula = Exp ~ Pop + RVU + FTE + Qual + Eff, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0081597 -0.0012542 -0.0003187 0.0016227 0.0060182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6075615 0.0140907 43.118 < 2e-16 ***
## Pop 0.0004451 0.0001543 2.885 0.00414 **
## RVU -0.0009126 0.0001287 -7.092 6.54e-12 ***
## FTE 0.1525039 0.0222891 6.842 3.16e-11 ***
## Qual -0.0007188 0.0007639 -0.941 0.34734
## Eff 0.0166986 0.0028740 5.810 1.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002214 on 378 degrees of freedom
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9775
## F-statistic: 3323 on 5 and 378 DF, p-value: < 2.2e-16
hist(mylm$residuals) #residuals look normally distributed. That's what we want!
mean(mylm$residuals) #near zero = unbiased
## [1] -1.787317e-19
vif(mylm) #Whoaaaaaa, Nellie! High VIFs
## Pop RVU FTE Qual Eff
## 11.243916 116.665179 78.788407 1.155269 5.640047
ncvTest(mylm) #This tests for constant variance (homoskedasticity). Ho: Homoskedastic, Ha: Hetero.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.159042 Df = 1 p = 0.02312565
plot(mylm) #A series of diagnostic plots.
There are some competing forces at play! Our independent variables are interwined. One way to handle this is to eliminate variables. That’s fine. Another way to handle this is to form a new variable by making a linear combination of others that maximizes the variability captured. Keep this in mind. We will do this later. I will remove variables and also add in a few. ONe of the things we CAN do in regression in R is use Factors. R automatically codes these into dummy (dichotomous variables), leaving one level out. You cannot do this automatically in Excel. So I am going to use FY as a set of two dummy variables. Of course, when I re-do my model, I need to evaluate for the new power transformations. I tell you what. I will just \(Exp~f(RVU,FY)\).
newdata=mydata[,-c(1,3,5,7,8)]
colnames(newdata)=c("FY","RVU","Exp")
myt=powerTransform(cbind(newdata$RVU, newdata$Exp)~1 )
newdata[,2]=newdata[,2]^myt$lambda[1]
newdata[,3]=newdata[,3]^myt$lambda[2]
mylm=lm(Exp~RVU+FY, data=newdata)
anova(mylm)
## Analysis of Variance Table
##
## Response: Exp
## Df Sum Sq Mean Sq F value Pr(>F)
## RVU 1 0.077486 0.077486 5963.244 < 2.2e-16 ***
## FY 2 0.001101 0.000551 42.372 < 2.2e-16 ***
## Residuals 380 0.004938 0.000013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mylm)
##
## Call:
## lm(formula = Exp ~ RVU + FY, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0118718 -0.0024445 0.0005778 0.0025761 0.0075117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.153e-01 5.081e-03 101.414 < 2e-16 ***
## RVU -2.616e-01 3.399e-03 -76.963 < 2e-16 ***
## FYY12 3.572e-05 4.507e-04 0.079 0.937
## FYY13 3.611e-03 4.506e-04 8.013 1.38e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003605 on 380 degrees of freedom
## Multiple R-squared: 0.9409, Adjusted R-squared: 0.9404
## F-statistic: 2016 on 3 and 380 DF, p-value: < 2.2e-16
hist(mylm$residuals) #looks ok...We could test normality, but the visual test is good enough
mean(mylm$residuals) #unbiased
## [1] -3.004408e-19
vif(mylm) #both near one, meaning that we have reasonably separated variables.
## GVIF Df GVIF^(1/(2*Df))
## RVU 1.001 1 1.00050
## FY 1.001 2 1.00025
ncvTest(mylm) #meets the null assumption that the data are homoskedastic
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.269758 Df = 1 p = 0.07056802
plot(mylm)
Our model is simple.
\(Exp^{-.11521}=.5153-.2616RVU^{.03179}+.00004FY12+.004FY13\)
We notice that this model captures 94% of the variability of the dependent variable. We also notice that only two of the coefficients are statistically significant. FY12 could be dropped. Further, we notice a small residual standard error.
If we were to forecast using this model, we would be forecasting \(Exp^{-.115}\) and would then need to convert this back to expenditures in dollars. Let’s say you want to forecast for a facility that has 1.5 \(RVU^{.03179}\) during FY11. We note that FY11 is not in the model but is captured by the intercept. Then
\(.5153 -.2616 \times 1.5 = .1228828\)
So expenditures would be forecasted as \(.1228828^{-1/.11521}\), nearly $80M.
(coef(mylm)[1]+coef(mylm)[2]*1.5)^(1/myt$lambda[2]) #79,977,373
## (Intercept)
## 79977373