Rregression Assigment 4

Problem 1:Abrasion data

In an industrial experiment 30 rubber tires were tested for abrasion loss. The input variables were hardness of the rubber in Shore units and tensile strength in kg/sq m. The output variable loss measured the abrasion loss in kg/hr. Fit a suitable model. Discuss model adequacy using suitable graphical methods as well as diagnostic tests. Predict the absrasion loss for a new tire with hard=70 and tens=170. Give the 95% interval. Repeat the prediction with hard=100 and tens=250. Comment on assumptions and validity of predictions.A snippet of the data is shown below and the full dataset is available in the file abrasionLoss.txt.

Firstly, fit the full model and use the dignostic plot to analyse the adequancy of the model.

## 
## Call:
## lm(formula = loss ~ hard + tens, data = abrasion)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.385 -14.608   3.816  19.755  65.981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 885.1611    61.7516  14.334 3.84e-14 ***
## hard         -6.5708     0.5832 -11.267 1.03e-11 ***
## tens         -1.3743     0.1943  -7.073 1.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.49 on 27 degrees of freedom
## Multiple R-squared:  0.8402, Adjusted R-squared:  0.8284 
## F-statistic:    71 on 2 and 27 DF,  p-value: 1.767e-11

## Analysis of Variance Table
## 
## Response: loss
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hard       1 122455  122455  91.970 3.458e-10 ***
## tens       1  66607   66607  50.025 1.325e-07 ***
## Residuals 27  35950    1331                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although there are three outiler points(14,19,28), the model is good due to the scatters in the RF and SL are randomly distributed.

Next use the diagnostic check(both boxcox and jarque-bera)

## Loading required package: MASS
## Loading required package: tseries

## 
##  Jarque Bera Test
## 
## data:  resid(ans1)
## X-squared = 1.1266, df = 2, p-value = 0.5693

the output of the test suggest lamda is close to 1 and the residual of the model is normally distributed, so the model is good.

Finally we do the prediction:

##                  2.5 %       97.5 %
## (Intercept) 758.457323 1011.8648954
## hard         -7.767432   -5.3742274
## tens         -1.773001   -0.9756228

##      fit      lwr      upr
## 1 191.57 115.3423 267.7978

##         fit       lwr       upr
## 1 -115.4998 -207.2442 -23.75546

the prediction of hard=70 and tens=170 is good, but the second prediction doesn’t good because the output is negative.This result comes because the prediction is out of range.

problem 2:Hospital staffing data

The goal of this study reported in an unpublished technical report is to predict the staffing requirement for hospitals in a particular provincial jurisdiction with 17 hospitals. The data, summarized below, are available in the file staffing.txt. The output variable iswork and the rows indicate particular hospitals.

new - mean number of new patients per day xrays - number of xrays per month beds - mean number of beds in use pop - population of jurisdiction in 1000’s length - average duration of stay in hospital work - number of hours of nursing needed per 7 day week

Fit the full model and discuss its adequacy. Find the best subset model using stagewise regression and best subset regression. Discuss the adequacy of the fitted model.

## 
## Call:
## lm(formula = work ~ ., data = staffing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -611.93 -431.41  -70.77  332.60 1576.06 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1962.94816 1071.36170   1.832   0.0941 .
## new          -15.85167   97.65299  -0.162   0.8740  
## xrays          0.05593    0.02126   2.631   0.0234 *
## beds           1.58962    3.09208   0.514   0.6174  
## pop           -4.21867    7.17656  -0.588   0.5685  
## length      -394.31412  209.63954  -1.881   0.0867 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 642.1 on 11 degrees of freedom
## Multiple R-squared:  0.9908, Adjusted R-squared:  0.9867 
## F-statistic: 237.8 on 5 and 11 DF,  p-value: 8.068e-11

## Analysis of Variance Table
## 
## Response: work
##           Df    Sum Sq   Mean Sq   F value    Pr(>F)    
## new        1 480612694 480612694 1165.7505 1.629e-12 ***
## xrays      1   7231656   7231656   17.5407  0.001516 ** 
## beds       1    598469    598469    1.4516  0.253550    
## pop        1    276098    276098    0.6697  0.430531    
## length     1   1458572   1458572    3.5378  0.086703 .  
## Residuals 11   4535052    412277                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

now we use the AIC and BIC to select the best subset regression.

## Start:  AIC=224.4
## work ~ new + xrays + beds + pop + length
## 
##          Df Sum of Sq     RSS    AIC
## - new     1     10863 4545916 222.44
## - beds    1    108962 4644015 222.80
## - pop     1    142465 4677517 222.93
## <none>                4535052 224.40
## - length  1   1458572 5993624 227.14
## - xrays   1   2853834 7388887 230.70
## 
## Step:  AIC=222.44
## work ~ xrays + beds + pop + length
## 
##          Df Sum of Sq      RSS    AIC
## - pop     1    367483  4913399 221.76
## <none>                 4545916 222.44
## - length  1   2008920  6554835 226.66
## - xrays   1   2874412  7420328 228.77
## - beds    1  19070222 23616137 248.45
## 
## Step:  AIC=221.76
## work ~ xrays + beds + length
## 
##          Df Sum of Sq      RSS    AIC
## <none>                 4913399 221.76
## - length  1   1658984  6572383 224.71
## - xrays   1   2628688  7542086 227.05
## - beds    1  32726195 37639593 254.38

## 
## Call:
## lm(formula = work ~ xrays + beds + length, data = staffing)
## 
## Coefficients:
## (Intercept)        xrays         beds       length  
##  1523.38924      0.05299      0.97848   -320.95083

## 
## Call:
## lm(formula = work ~ xrays + beds + length, data = staffing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -687.40 -380.60  -25.03  281.91 1630.50 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1523.38924  786.89772   1.936   0.0749 .  
## xrays          0.05299    0.02009   2.637   0.0205 *  
## beds           0.97848    0.10515   9.305 4.12e-07 ***
## length      -320.95083  153.19222  -2.095   0.0563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 614.8 on 13 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9878 
## F-statistic:   432 on 3 and 13 DF,  p-value: 2.894e-13

## Start:  AIC=229.4
## work ~ new + xrays + beds + pop + length
## 
##          Df Sum of Sq     RSS    AIC
## - new     1     10863 4545916 226.61
## - beds    1    108962 4644015 226.97
## - pop     1    142465 4677517 227.09
## <none>                4535052 229.40
## - length  1   1458572 5993624 231.31
## - xrays   1   2853834 7388887 234.87
## 
## Step:  AIC=226.61
## work ~ xrays + beds + pop + length
## 
##          Df Sum of Sq      RSS    AIC
## - pop     1    367483  4913399 225.09
## <none>                 4545916 226.61
## - length  1   2008920  6554835 230.00
## - xrays   1   2874412  7420328 232.10
## - beds    1  19070222 23616137 251.78
## 
## Step:  AIC=225.1
## work ~ xrays + beds + length
## 
##          Df Sum of Sq      RSS    AIC
## <none>                 4913399 225.09
## - length  1   1658984  6572383 227.21
## - xrays   1   2628688  7542086 229.55
## - beds    1  32726195 37639593 256.88

## 
## Call:
## lm(formula = work ~ xrays + beds + length, data = staffing)
## 
## Coefficients:
## (Intercept)        xrays         beds       length  
##  1523.38924      0.05299      0.97848   -320.95083

## 
## Call:
## lm(formula = work ~ xrays + beds + length, data = staffing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -687.40 -380.60  -25.03  281.91 1630.50 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1523.38924  786.89772   1.936   0.0749 .  
## xrays          0.05299    0.02009   2.637   0.0205 *  
## beds           0.97848    0.10515   9.305 4.12e-07 ***
## length      -320.95083  153.19222  -2.095   0.0563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 614.8 on 13 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9878 
## F-statistic:   432 on 3 and 13 DF,  p-value: 2.894e-13

now, we use the best subset regression.

## Loading required package: bestglm
## Loading required package: leaps

## BIC
## BICq equivalent for q in (0.25804914597404, 0.680450993834172)
## Best Model:
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 1523.38923568 786.89772473  1.935943 7.492387e-02
## xrays          0.05298733   0.02009194  2.637243 2.050318e-02
## beds           0.97848162   0.10515362  9.305258 4.121293e-07
## length      -320.95082518 153.19222065 -2.095086 5.631250e-02

## Analysis of Variance Table
## 
## Response: work
##           Df    Sum Sq   Mean Sq   F value    Pr(>F)    
## xrays      1 441952483 441952483 1169.3296 4.041e-14 ***
## beds       1  46187675  46187675  122.2046 5.556e-08 ***
## length     1   1658984   1658984    4.3894   0.05631 .  
## Residuals 13   4913399    377954                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

this result also indicates the best subset regression is : work=1523.38923568+0.05298733xrays + 0.97848162lbeds-320.95082518*length and the model is good from the anova above, which suggests the variales is statistically significant.

##        fit      lwr      upr
## 1 11163.85 9719.929 12607.77

so the 95% prediction interval is (9719.929, 12607.77)

Assumption: The model is correct and the residual of the model is normally disributed.And the parameters are known.

problem 3: Weight loss

Thirteen pieces of iron we varying allows with varying iron (Fe) content were placed in seawater for2 months and the weight loss (loss) in mg per square decimeter per day was measured.

## 
##  Jarque Bera Test
## 
## data:  resid(ans3)
## X-squared = 1.1877, df = 2, p-value = 0.5522

using the Jarque-Bera test we know the residual of the polynomial regression model of degree 6 does follow the normal distribution due to the p-value is 0.552, so it is a statistically good model for prediction purposes.

now use the backward regression to choose the best polynomial regression

##     polynomial 1 polynomial 2 polynomial 3 polynomial 4 polynomial 5
## AIC     69.78062     71.42649     63.04931     61.75842     52.28024
## BIC     71.47547     73.68629     65.87406     65.14812     56.23488
##     polynomial 6
## AIC     51.61314
## BIC     56.13274

so the AIC suggest the model the polynomial regression of degree 6 is the best where the BIC give the more parsimonious model is also polynomial regression of degree 6.

Rregression Assigment 4

Tingbin Tan

Tuesday, November 25, 2014

Problem 1:Abrasion data

problem 2:Hospital staffing data

problem 3: Weight loss