Assignment3 Regression Analysis

Question 1

Importing dataset

asphalt <- read_csv("/Users/shubhamchougule/Downloads/asphalt.csv")

## Parsed with column specification:
## cols(
##   y = col_double(),
##   visc = col_double(),
##   `%surf` = col_double(),
##   `%base` = col_double(),
##   `%fines` = col_double(),
##   `%voids` = col_double(),
##   run = col_double()
## )

Question 1 (a)

Fitting the model

fit<-lm(y~.,data = asphalt)
summary(fit)

## 
## Call:
## lm(formula = y ~ ., data = asphalt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6781 -1.8309  0.1751  1.4858 11.1262 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.970450  36.118989  -1.743   0.0941 .  
## visc          0.003071   0.008161   0.376   0.7100    
## `%surf`       7.498028   3.967155   1.890   0.0709 .  
## `%base`       6.225817   4.812723   1.294   0.2081    
## `%fines`      0.522211   1.174673   0.445   0.6606    
## `%voids`     -0.241275   1.684963  -0.143   0.8873    
## run          -5.386297   0.985384  -5.466 1.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.94 on 24 degrees of freedom
## Multiple R-squared:  0.7274, Adjusted R-squared:  0.6592 
## F-statistic: 10.67 on 6 and 24 DF,  p-value: 8.588e-06

Summary of the lm function give us the values for thr equation of fit model.

Best fit model for the full regression model with six predictors to the data set is

y=b0+b1.x1+b2.x2+b3.x3+b4.x5+b6.x6+e

y=-62.97+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids-(5.38)run

Adjusted R-squared we got is 65.92% which is good value and p value is also less than 0.05.

Anova test

mod0 <- lm(y ~ 1,data = asphalt)
anova(fit,mod0)

## Analysis of Variance Table
## 
## Model 1: y ~ visc + `%surf` + `%base` + `%fines` + `%voids` + run
## Model 2: y ~ 1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     24  372.51                                  
## 2     30 1366.52 -6      -994 10.674 8.588e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova test for the fit model.

The obtaine overall p value of the fit is less than 0.05(a) hence we have a good fit. And can continue further for analysis.

Question 1 (b)

#the fitted equation for y when the run indicator is 1 and when it is -1.

#equation of line is

y=b0+b1.x1+b2.x2+b3.x3+b4.x5+b6.x6+e

y=-62.97+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids-(5.38)run

#run={1,-1}

#when run=1

y=b0+b1.x1+b2.x2+b3.x3+b4.x5+b6+e

y=-68.35+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids

#when run=-1

y=b0+b1.x1+b2.x2+b3.x3+b4.x5-b6+e

y=-57.59+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids

Question 1 (c)

ols_step_all_possible(fit)

##    Index N                                 Predictors   R-Square Adj. R-Square
## 6      1 1                                        run 0.65755826  0.6457499286
## 1      2 1                                       visc 0.31095511  0.2871949378
## 5      3 1                                   `%voids` 0.17970999  0.1514241240
## 4      4 1                                   `%fines` 0.07244760  0.0404630335
## 3      5 1                                    `%base` 0.04418469  0.0112255443
## 2      6 1                                    `%surf` 0.03260116 -0.0007574194
## 15     7 2                                `%surf` run 0.69469939  0.6728922025
## 18     8 2                                `%base` run 0.67054201  0.6470093012
## 21     9 2                               `%voids` run 0.66837793  0.6446906378
## 20    10 2                               `%fines` run 0.66422323  0.6402391724
## 11    11 2                                   visc run 0.65895777  0.6345976113
## 10    12 2                              visc `%voids` 0.34222182  0.2952376594
## 9     13 2                              visc `%fines` 0.33205132  0.2843406952
## 7     14 2                               visc `%surf` 0.32275960  0.2743852908
## 8     15 2                               visc `%base` 0.32205656  0.2736320274
## 19    16 2                          `%fines` `%voids` 0.28241390  0.2311577462
## 17    17 2                           `%base` `%voids` 0.21922390  0.1634541745
## 14    18 2                           `%surf` `%voids` 0.17971528  0.1211235195
## 13    19 2                           `%surf` `%fines` 0.12034404  0.0575114670
## 12    20 2                            `%surf` `%base` 0.10409255  0.0400991655
## 16    21 2                           `%base` `%fines` 0.08304511  0.0175483337
## 34    22 3                        `%surf` `%base` run 0.72392146  0.6932460640
## 36    23 3                       `%surf` `%fines` run 0.70650950  0.6738994421
## 37    24 3                       `%surf` `%voids` run 0.69488278  0.6609808705
## 25    25 3                           visc `%surf` run 0.69470235  0.6607803907
## 40    26 3                       `%base` `%voids` run 0.68193515  0.6465946133
## 41    27 3                      `%fines` `%voids` run 0.68035604  0.6448400402
## 39    28 3                       `%base` `%fines` run 0.67189892  0.6354432470
## 28    29 3                           visc `%base` run 0.67096103  0.6344011455
## 31    30 3                          visc `%voids` run 0.66840244  0.6315582685
## 30    31 3                          visc `%fines` run 0.66507010  0.6278556636
## 29    32 3                     visc `%fines` `%voids` 0.38197598  0.3133066485
## 27    33 3                      visc `%base` `%voids` 0.35641483  0.2849053620
## 23    34 3                      visc `%surf` `%fines` 0.35039467  0.2782163012
## 24    35 3                      visc `%surf` `%voids` 0.34361049  0.2706783259
## 22    36 3                       visc `%surf` `%base` 0.34353968  0.2705996495
## 26    37 3                      visc `%base` `%fines` 0.33455532  0.2606170192
## 38    38 3                  `%base` `%fines` `%voids` 0.28642038  0.2071337507
## 35    39 3                  `%surf` `%fines` `%voids` 0.28320788  0.2035643114
## 33    40 3                   `%surf` `%base` `%voids` 0.22299454  0.1366606039
## 32    41 3                   `%surf` `%base` `%fines` 0.14516806  0.0501867364
## 53    42 4               `%surf` `%base` `%fines` run 0.72576870  0.6835792648
## 44    43 4                   visc `%surf` `%base` run 0.72515239  0.6828681403
## 54    44 4               `%surf` `%base` `%voids` run 0.72397377  0.6815081916
## 55    45 4              `%surf` `%fines` `%voids` run 0.70779239  0.6628373697
## 46    46 4                  visc `%surf` `%fines` run 0.70667231  0.6615449762
## 47    47 4                  visc `%surf` `%voids` run 0.69488635  0.6479457881
## 56    48 4              `%base` `%fines` `%voids` run 0.68625512  0.6379866818
## 50    49 4                  visc `%base` `%voids` run 0.68214999  0.6332499858
## 51    50 4                 visc `%fines` `%voids` run 0.68062905  0.6314950598
## 49    51 4                  visc `%base` `%fines` run 0.67225422  0.6218317947
## 45    52 4             visc `%surf` `%fines` `%voids` 0.38454367  0.2898580768
## 48    53 4             visc `%base` `%fines` `%voids` 0.38370552  0.2888909813
## 43    54 4              visc `%surf` `%base` `%voids` 0.36211989  0.2639844838
## 42    55 4              visc `%surf` `%base` `%fines` 0.35797706  0.2592043041
## 52    56 4          `%surf` `%base` `%fines` `%voids` 0.28882879  0.1794178396
## 58    57 5          visc `%surf` `%base` `%fines` run 0.72716609  0.6725993100
## 62    58 5      `%surf` `%base` `%fines` `%voids` run 0.72579057  0.6709486885
## 59    59 5          visc `%surf` `%base` `%voids` run 0.72515420  0.6701850413
## 60    60 5         visc `%surf` `%fines` `%voids` run 0.70839139  0.6500696665
## 61    61 5         visc `%base` `%fines` `%voids` run 0.68682460  0.6241895189
## 57    62 5     visc `%surf` `%base` `%fines` `%voids` 0.38801966  0.2656235867
## 63    63 6 visc `%surf` `%base` `%fines` `%voids` run 0.72739899  0.6592487329
##    Mallow's Cp
## 6     3.148830
## 1    33.664035
## 5    45.218955
## 4    54.662417
## 3    57.150705
## 2    58.170527
## 15    1.878897
## 18    4.005731
## 21    4.196259
## 20    4.562042
## 11    5.025617
## 10   32.911290
## 9    33.806709
## 7    34.624758
## 8    34.686655
## 19   38.176825
## 17   43.740120
## 14   47.218488
## 13   52.445578
## 12   53.876371
## 16   55.729404
## 34    1.306164
## 36    2.839126
## 37    3.862751
## 25    3.878637
## 40    5.002670
## 41    5.141697
## 39    5.886268
## 28    5.968840
## 31    6.194101
## 30    6.487483
## 29   31.411303
## 27   33.661727
## 23   34.191746
## 24   34.789030
## 22   34.795264
## 26   35.586255
## 38   39.824091
## 35   40.106922
## 33   45.408150
## 32   52.260052
## 53    3.143532
## 44    3.197792
## 54    3.301559
## 55    4.726180
## 46    4.824792
## 47    5.862437
## 56    6.622337
## 50    6.983756
## 51    7.117661
## 49    7.854987
## 45   33.185242
## 48   33.259034
## 43   35.159449
## 42   35.524186
## 52   41.612052
## 58    5.020504
## 62    5.141606
## 59    5.197633
## 60    6.673443
## 61    8.572200
## 57   34.879214
## 63    7.000000

fit<-lm(y~`%surf`+`%base`+run,data = asphalt)
summary(fit)

## 
## Call:
## lm(formula = y ~ `%surf` + `%base` + run, data = asphalt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9414 -1.7181 -0.1026  1.4959 11.0532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.6005    28.0368  -2.233   0.0340 *  
## `%surf`       7.4270     3.2506   2.285   0.0304 *  
## `%base`       6.8282     4.0391   1.691   0.1024    
## run          -5.2685     0.6767  -7.786 2.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.738 on 27 degrees of freedom
## Multiple R-squared:  0.7239, Adjusted R-squared:  0.6932 
## F-statistic:  23.6 on 3 and 27 DF,  p-value: 1.044e-07

Function ols_step_all_possible is used to check all the possibilites of the subset regression to select the best model. Here we need to select best model for that we will use adjusted r squared value. Because R square adjusted: used in the comparison of models with different number of variables.

From observation it is clear that {%surf,%base,run} have the highest adj R squared value 69.32% hence this is the best model.

We will fit this three features and summerise the new fit model.

Diagnostic tests and adequacy of this model

#Testing for outliers
plot(fit)

Residuals vs Fitted:We donot get a flat line thus model do not have a linear relationship. Hence heteroscedasticity assumption is violated.

Normal Q-Q plot:Large devation can be seen. Residuals are not distributes and it violates the assumption

Sclae -location:There is no straight line. There is no sign of homoscedasticity hence the assumption is violated.

Residuals vs Leverage: There is no values in the upper and lower right hand side of the plot inside the red bands, therefore there is no evidence of influential cases.

#Testing for outliers
outlierTest(fit)

##    rstudent unadjusted p-value Bonferroni p
## 13 4.351371         0.00018611    0.0057695

Assumptions; H0 = observation is NOT an outlier HA = observation is an outlier

OutlierTest function runs a Bonferonni adjusted outlier test. Null hypothesis for this test is that the observation is NOT an outlier. Observation 13 is the extreme observation in our data, here the null can be rejected because the Bonferroni p is less than 0.05.

#Testing for Normality
# distribution of studentized residuals
sresid <- studres(fit) 
hist(sresid, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

Residuals are not distributed properly, graph is tend towards left. Normality dose not look good. Hence normality is violated.

shapiro.test(fit$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residuals
## W = 0.93002, p-value = 0.04391

Assumptions H0 = Errors are normally distributed HA = Errors are not normally distributed

Here P value is less than 0.05 hence we reject the H0 and accept HA. Hence errors are not normally distributed.

#Testing the Homoscedasticity Assumption
ncvTest(fit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 13.25413, Df = 1, p = 0.00027198

P < .05, suggesting that our data is not homoscedastic. The constant error variance assumption is violated.

#Testing the Independence Assumption
durbinWatsonTest(fit)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.1063792      2.159149   0.722
##  Alternative hypothesis: rho != 0

Assumptions H0 = Errors are uncorrelated H1 = Errors are correlated

P > 0.05, so the errors are nuncorrelated.

Question 2

Importing data

byssinosis <- read_csv("/Users/shubhamchougule/Downloads/byssinosis.csv")

## Parsed with column specification:
## cols(
##   BysYes = col_double(),
##   BysNo = col_double(),
##   Total = col_double(),
##   Dust = col_double(),
##   Race = col_double(),
##   Sex = col_double(),
##   Smoke = col_double(),
##   Employ = col_double()
## )

Question 2(a)

model1<-glm(cbind(BysNo,Total-BysNo)~Dust + Race + Sex + Smoke + Employ,family = "binomial",data=byssinosis)
summary(model1)

## 
## Call:
## glm(formula = cbind(BysNo, Total - BysNo) ~ Dust + Race + Sex + 
##     Smoke + Employ, family = "binomial", data = byssinosis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9804  -0.3688   0.2421   0.7573   3.4126  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4852     0.6060   0.801  0.42331    
## Dust          1.3751     0.1155  11.901  < 2e-16 ***
## Race         -0.2463     0.2061  -1.195  0.23203    
## Sex           0.2590     0.2116   1.224  0.22095    
## Smoke         0.6292     0.1931   3.259  0.00112 ** 
## Employ       -0.3856     0.1069  -3.607  0.00031 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 322.527  on 64  degrees of freedom
## Residual deviance:  69.509  on 59  degrees of freedom
## AIC: 188.19
## 
## Number of Fisher Scoring iterations: 5

To fit logistic regression we use glm function. From the summary we see that residual deviance :69.509 on 59 degrees of freedom.

Main equation formed is

y=0.4852+(1.3751)Dust-(0.2463)Race+(0.2590)Sex+(0.6292)Smoke-(0.3856)*Employ

From the predictors Dust,Smoke and Employ have a significant effect on the presence of byssinosis based on the p value obtained which is less than 0.05 makes this three predictors significant.

Building model for significant model

model<-glm(cbind(BysNo,Total-BysNo)~Dust  + Smoke + Employ,family = "binomial",data=byssinosis)
summary(model)

## 
## Call:
## glm(formula = cbind(BysNo, Total - BysNo) ~ Dust + Smoke + Employ, 
##     family = "binomial", data = byssinosis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0523  -0.4001   0.2518   0.7700   3.3421  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.14177    0.34120   0.415 0.677783    
## Dust         1.46572    0.10578  13.856  < 2e-16 ***
## Smoke        0.67781    0.18871   3.592 0.000328 ***
## Employ      -0.33313    0.08861  -3.760 0.000170 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 322.527  on 64  degrees of freedom
## Residual deviance:  72.562  on 61  degrees of freedom
## AIC: 187.24
## 
## Number of Fisher Scoring iterations: 5

Equation formed for significant predictors is

y=0.142+(1.466)Dust+(0.678)Smoke-(0.333)*Employ

Question 2(b)

#Check the adequacy of main model
deviance(model1)

## [1] 69.50926

# Gives the p value for the goodness of fit test
pchisq(model1$deviance, df=model1$df.residual, lower.tail=FALSE)

## [1] 0.1645594

Assumptions H0:the model fits data well HA:the model does not fits data well

Chiq-sq test stat with 69.509 on 59 degrees of freedom. gives p values of 0.1645 indicating that null hypothesis is plausible and we conclude that logistic model is adequate.

#Check the adequacy of significant model
deviance(model)

## [1] 72.56192

# Gives the p value for the goodness of fit test
pchisq(model$deviance, df=model$df.residual, lower.tail=FALSE)

## [1] 0.1476404

Assumptions H0:the model fits data well HA:the model does not fits data well

Chiq-sq test stat with 72.562 on 61 degrees of freedom. gives p values of 0.1476 indicating that null hypothesis is plausible and we conclude that logistic model is adequate.

Question 2(c)

Here we will use the significant model for calculation

y=0.142+(1.466)Dust+(0.678)Smoke-(0.333)*Employ

Only we will use x1,x4 and x5 for calculation as x2 and x3 are not significant. Considering y as the fitted logit

xl = 2, x2 = 2, x3 = 1, x4 = 2, x5 = 3

y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5

y=0.142+(1.466)(2)+(0.678)(2)-(0.333)*(3)

y=3.431

probability can be calculate by using

p=exponential(y)/(1+exponential(y))

p=exponential(3.431)/(1+exponential(3.431))

p=0.968

xl = 1, x2 = 2, x3 = 2, x4 = 1, x5 = 3

y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5

y=0.142+(1.466)(1)+(0.678)(1)-(0.333)*(3)

y=1.287

probability can be calculate by using

p=exponential(y)/(1+exponential(y))

p=exponential(1.287)/(1+exponential(1.287))

p=0.783

xl = 2, x2 = 1, x3 = 1, x4 = 2, x5 = 2,

y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5

y=0.142+(1.466)(2)+(0.678)(2)-(0.333)*(2)

y=3.764

probability can be calculate by using

p=exponential(y)/(1+exponential(y))

p=exponential(3.764)/(1+exponential(3.764))

p=0.977

xl = 3, x2 = 1, x3 = 2, x4 = 2, x5 = 1

y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5

y=0.142+(1.466)(3)+(0.678)(2)-(0.333)*(1)

y=5.563

probability can be calculate by using

p=exponential(y)/(1+exponential(y))

p=exponential(5.563)/(1+exponential(5.563))

p=0.996

S3800545 Assignment 3 Regression Analysis

Shubham Chougule

Assignment3 Regression Analysis

Question 1

Importing dataset

Question 1 (a)

Fitting the model

Anova test

Question 1 (b)

Question 1 (c)

Diagnostic tests and adequacy of this model

Question 2

Importing data

Question 2(a)

Question 2(b)

Question 2(c)