Descriptive Statistics

mydata=read.csv("D:/week4.csv")
str(mydata)
## 'data.frame':    384 obs. of  8 variables:
##  $ Hospital.ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Fiscal.Year         : Factor w/ 3 levels "Y11","Y12","Y13": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population.Serviced : int  25294 42186 23772 2085 67258 23752 53781 98763 47986 64634 ...
##  $ Relative.Value.Units: num  402704 638252 447030 43337 1579789 ...
##  $ Full.Time.Eqivalents: num  955 949 953 200 2162 ...
##  $ Expenditures        : num  1.15e+08 1.16e+08 1.20e+08 1.91e+07 2.46e+08 ...
##  $ Quality.Score       : num  0.668 0.581 0.523 0.927 0.955 ...
##  $ Efficiency.Score    : num  0.457 0.727 0.486 0.778 0.889 ...
head(mydata)
##   Hospital.ID Fiscal.Year Population.Serviced Relative.Value.Units
## 1           1         Y11               25294            402703.73
## 2           2         Y11               42186            638251.99
## 3           3         Y11               23772            447029.54
## 4           4         Y11                2085             43337.26
## 5           5         Y11               67258           1579789.36
## 6           6         Y11               23752            673036.55
##   Full.Time.Eqivalents Expenditures Quality.Score Efficiency.Score
## 1               954.91    114948144     0.6677083           0.4566
## 2               949.25    116423140     0.5812500           0.7272
## 3               952.51    119977702     0.5229167           0.4858
## 4               199.98     19056531     0.9270833           0.7783
## 5              2162.15    246166031     0.9552083           0.8895
## 6              1359.07    152125186     0.5552083           0.5784
names(mydata)
## [1] "Hospital.ID"          "Fiscal.Year"          "Population.Serviced" 
## [4] "Relative.Value.Units" "Full.Time.Eqivalents" "Expenditures"        
## [7] "Quality.Score"        "Efficiency.Score"
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 3.3.3
## ResourceSelection 0.3-2   2017-02-28
library(psych)
## Warning: package 'psych' was built under R version 3.3.3
describe(mydata)
##                      vars   n         mean           sd      median
## Hospital.ID             1 384        64.50        37.00       64.50
## Fiscal.Year*            2 384         2.00         0.82        2.00
## Population.Serviced     3 384     24713.82     22756.42    16466.00
## Relative.Value.Units    4 384    546857.77    680047.21   246701.71
## Full.Time.Eqivalents    5 384      1060.86      1336.29      483.07
## Expenditures            6 384 124765816.51 173435868.06 52796103.84
## Quality.Score           7 384         0.71         0.12        0.72
## Efficiency.Score        8 384         0.73         0.17        0.73
##                          trimmed         mad        min          max
## Hospital.ID                64.50       47.44       1.00 1.280000e+02
## Fiscal.Year*                2.00        1.48       1.00 3.000000e+00
## Population.Serviced     20647.21    13092.10    1218.00 1.198500e+05
## Relative.Value.Units   398680.18   237760.93   23218.01 3.574006e+06
## Full.Time.Eqivalents      750.11      401.38     116.29 7.518630e+03
## Expenditures         84600936.46 47444814.57 7839562.52 1.301994e+09
## Quality.Score               0.72        0.11       0.31 9.600000e-01
## Efficiency.Score            0.73        0.20       0.34 1.000000e+00
##                             range  skew kurtosis         se
## Hospital.ID          1.270000e+02  0.00    -1.21       1.89
## Fiscal.Year*         2.000000e+00  0.00    -1.51       0.04
## Population.Serviced  1.186320e+05  1.94     4.14    1161.28
## Relative.Value.Units 3.550788e+06  2.10     4.27   34703.51
## Full.Time.Eqivalents 7.402340e+03  2.55     6.94      68.19
## Expenditures         1.294154e+09  3.04    11.27 8850612.08
## Quality.Score        6.500000e-01 -0.54     0.37       0.01
## Efficiency.Score     6.600000e-01 -0.03    -0.99       0.01
kdepairs(mydata[,3:8])

Transform Data to Multivariate Normality

library(car)
## Warning: package 'car' was built under R version 3.3.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
library(MASS)
## Warning: package 'MASS' was built under R version 3.3.3
myt=powerTransform(cbind(mydata$Population.Serviced, mydata$Relative.Value.Units, mydata$Full.Time.Eqivalents, mydata$Expenditures, mydata$Quality.Score, mydata$Efficiency.Score)~1 )  #searches for the optimal power transformation to obtain multivariate normality

testTransform(myt, myt$lambda)  #no difference from multivariate normality with t p-value of 1.
##                                                     LRT df pval
## LR test, lambda = (0.25 0.27 -0.08 -0.02 2.03 0.44)   0  6    1
newdata=mydata  #build transformations by raising each variable to the appropriate power value
newdata[,3]=newdata[,3]^myt$lambda[1]
newdata[,4]=newdata[,4]^myt$lambda[2]
newdata[,5]=newdata[,5]^myt$lambda[3]
newdata[,6]=newdata[,6]^myt$lambda[4]
newdata[,7]=newdata[,7]^myt$lambda[5]
newdata[,8]=newdata[,8]^myt$lambda[6]


kdepairs(newdata[,3:8])  #looks a lot better...The distributions are more normal and meet the assumptions of multivariate normal.

Let’s look at correlations

colnames(newdata)=c("ID", "FY", "Pop", "RVU", "FTE","Exp", "Qual","Eff")
cor(newdata[,3:8])  #run correlations
##              Pop         RVU        FTE        Exp       Qual        Eff
## Pop   1.00000000  0.93241054 -0.8815480 -0.8841802  0.2386684 0.02016303
## RVU   0.93241054  1.00000000 -0.9629433 -0.9626597  0.3161178 0.06534048
## FTE  -0.88154798 -0.96294331  1.0000000  0.9866701 -0.2862752 0.17062222
## Exp  -0.88418019 -0.96265966  0.9866701  1.0000000 -0.2978836 0.14762309
## Qual  0.23866843  0.31611777 -0.2862752 -0.2978836  1.0000000 0.10690171
## Eff   0.02016303  0.06534048  0.1706222  0.1476231  0.1069017 1.00000000
gr <- colorRampPalette(c("#B52127", "white", "#2171B5")) #set a color gradient
cor.plot(newdata[,3:8], number=TRUE,stars=TRUE, upper=FALSE,gr=gr)  #plot the correlations

Let’s do a first regression

It seems a good model might be \(Exp = f(Pop+RVU+FTE)\). But are these three independent variables independent enough not to cause problems with estimation? Probably not. There is something called collinearity occurring. We see a statistically significant correlationa among the three that have large effect sizes (correlation coefficients), so regression coefficient estimates are likely to be unstable. We could analyze this post-hoc using something called the variance inflation factor (VIF). Large values of the VIF incidate problems with the model. https://statisticalhorizons.com/multicollinearity

Let’s build our model with all variables and check the VIF.

mylm = lm (Exp~Pop+RVU+FTE+Qual+Eff, data=newdata)

anova(mylm)
## Analysis of Variance Table
## 
## Response: Exp
##            Df   Sum Sq  Mean Sq    F value    Pr(>F)    
## Pop         1 0.065132 0.065132 13283.0475 < 2.2e-16 ***
## RVU         1 0.012190 0.012190  2486.0500 < 2.2e-16 ***
## FTE         1 0.003965 0.003965   808.6429 < 2.2e-16 ***
## Qual        1 0.000007 0.000007     1.3924    0.2387    
## Eff         1 0.000166 0.000166    33.7587 1.326e-08 ***
## Residuals 378 0.001853 0.000005                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mylm)  
## 
## Call:
## lm(formula = Exp ~ Pop + RVU + FTE + Qual + Eff, data = newdata)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0081597 -0.0012542 -0.0003187  0.0016227  0.0060182 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.6075615  0.0140907  43.118  < 2e-16 ***
## Pop          0.0004451  0.0001543   2.885  0.00414 ** 
## RVU         -0.0009126  0.0001287  -7.092 6.54e-12 ***
## FTE          0.1525039  0.0222891   6.842 3.16e-11 ***
## Qual        -0.0007188  0.0007639  -0.941  0.34734    
## Eff          0.0166986  0.0028740   5.810 1.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002214 on 378 degrees of freedom
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9775 
## F-statistic:  3323 on 5 and 378 DF,  p-value: < 2.2e-16
hist(mylm$residuals) #residuals look normally distributed.  That's what we want!

mean(mylm$residuals)  #near zero = unbiased
## [1] -1.787317e-19
vif(mylm)  #Whoaaaaaa, Nellie!  High VIFs
##        Pop        RVU        FTE       Qual        Eff 
##  11.243916 116.665179  78.788407   1.155269   5.640047
ncvTest(mylm)  #This tests for constant variance (homoskedasticity). Ho: Homoskedastic, Ha:  Hetero.  
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5.159042    Df = 1     p = 0.02312565
plot(mylm)  #A series of diagnostic plots.

There are some competing forces at play! Our independent variables are interwined. One way to handle this is to eliminate variables. That’s fine. Another way to handle this is to form a new variable by making a linear combination of others that maximizes the variability captured. Keep this in mind. We will do this later. I will remove variables and also add in a few. ONe of the things we CAN do in regression in R is use Factors. R automatically codes these into dummy (dichotomous variables), leaving one level out. You cannot do this automatically in Excel. So I am going to use FY as a set of two dummy variables. Of course, when I re-do my model, I need to evaluate for the new power transformations. I tell you what. I will just \(Exp~f(RVU,FY)\).

newdata=mydata[,-c(1,3,5,7,8)]
colnames(newdata)=c("FY","RVU","Exp")
myt=powerTransform(cbind(newdata$RVU, newdata$Exp)~1 ) 
newdata[,2]=newdata[,2]^myt$lambda[1]
newdata[,3]=newdata[,3]^myt$lambda[2]


mylm=lm(Exp~RVU+FY, data=newdata)
anova(mylm)
## Analysis of Variance Table
## 
## Response: Exp
##            Df   Sum Sq  Mean Sq  F value    Pr(>F)    
## RVU         1 0.077486 0.077486 5963.244 < 2.2e-16 ***
## FY          2 0.001101 0.000551   42.372 < 2.2e-16 ***
## Residuals 380 0.004938 0.000013                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mylm)
## 
## Call:
## lm(formula = Exp ~ RVU + FY, data = newdata)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0118718 -0.0024445  0.0005778  0.0025761  0.0075117 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.153e-01  5.081e-03 101.414  < 2e-16 ***
## RVU         -2.616e-01  3.399e-03 -76.963  < 2e-16 ***
## FYY12        3.572e-05  4.507e-04   0.079    0.937    
## FYY13        3.611e-03  4.506e-04   8.013 1.38e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003605 on 380 degrees of freedom
## Multiple R-squared:  0.9409, Adjusted R-squared:  0.9404 
## F-statistic:  2016 on 3 and 380 DF,  p-value: < 2.2e-16
hist(mylm$residuals)  #looks ok...We could test normality, but the visual test is good enough

mean(mylm$residuals)  #unbiased
## [1] -3.004408e-19
vif(mylm)  #both near one, meaning that we have reasonably separated variables.
##      GVIF Df GVIF^(1/(2*Df))
## RVU 1.001  1         1.00050
## FY  1.001  2         1.00025
ncvTest(mylm)  #meets the null assumption that the data are homoskedastic
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3.269758    Df = 1     p = 0.07056802
plot(mylm)

Interpretation

Our model is simple.

\(Exp^{-.11521}=.5153-.2616RVU^{.03179}+.00004FY12+.004FY13\)

We notice that this model captures 94% of the variability of the dependent variable. We also notice that only two of the coefficients are statistically significant. FY12 could be dropped. Further, we notice a small residual standard error.

If we were to forecast using this model, we would be forecasting \(Exp^{-.115}\) and would then need to convert this back to expenditures in dollars. Let’s say you want to forecast for a facility that has 1.5 \(RVU^{.03179}\) during FY11. We note that FY11 is not in the model but is captured by the intercept. Then

\(.5153 -.2616 \times 1.5 = .1228828\)

So expenditures would be forecasted as \(.1228828^{-1/.11521}\), nearly $80M.

(coef(mylm)[1]+coef(mylm)[2]*1.5)^(1/myt$lambda[2]) #79,977,373
## (Intercept) 
##    79977373