asphalt <- read_csv("/Users/shubhamchougule/Downloads/asphalt.csv")
## Parsed with column specification:
## cols(
## y = col_double(),
## visc = col_double(),
## `%surf` = col_double(),
## `%base` = col_double(),
## `%fines` = col_double(),
## `%voids` = col_double(),
## run = col_double()
## )
fit<-lm(y~.,data = asphalt)
summary(fit)
##
## Call:
## lm(formula = y ~ ., data = asphalt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6781 -1.8309 0.1751 1.4858 11.1262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.970450 36.118989 -1.743 0.0941 .
## visc 0.003071 0.008161 0.376 0.7100
## `%surf` 7.498028 3.967155 1.890 0.0709 .
## `%base` 6.225817 4.812723 1.294 0.2081
## `%fines` 0.522211 1.174673 0.445 0.6606
## `%voids` -0.241275 1.684963 -0.143 0.8873
## run -5.386297 0.985384 -5.466 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.94 on 24 degrees of freedom
## Multiple R-squared: 0.7274, Adjusted R-squared: 0.6592
## F-statistic: 10.67 on 6 and 24 DF, p-value: 8.588e-06
Summary of the lm function give us the values for thr equation of fit model.
Best fit model for the full regression model with six predictors to the data set is
y=b0+b1.x1+b2.x2+b3.x3+b4.x5+b6.x6+e
y=-62.97+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids-(5.38)run
Adjusted R-squared we got is 65.92% which is good value and p value is also less than 0.05.
mod0 <- lm(y ~ 1,data = asphalt)
anova(fit,mod0)
## Analysis of Variance Table
##
## Model 1: y ~ visc + `%surf` + `%base` + `%fines` + `%voids` + run
## Model 2: y ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 372.51
## 2 30 1366.52 -6 -994 10.674 8.588e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova test for the fit model.
The obtaine overall p value of the fit is less than 0.05(a) hence we have a good fit. And can continue further for analysis.
#the fitted equation for y when the run indicator is 1 and when it is -1.
#equation of line is
y=b0+b1.x1+b2.x2+b3.x3+b4.x5+b6.x6+e
y=-62.97+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids-(5.38)run
#run={1,-1}
#when run=1
y=b0+b1.x1+b2.x2+b3.x3+b4.x5+b6+e
y=-68.35+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids
#when run=-1
y=b0+b1.x1+b2.x2+b3.x3+b4.x5-b6+e
y=-57.59+(0.003)visc+(7.50)surf+(6.22)base+(0.522)fines-(0.241)voids
ols_step_all_possible(fit)
## Index N Predictors R-Square Adj. R-Square
## 6 1 1 run 0.65755826 0.6457499286
## 1 2 1 visc 0.31095511 0.2871949378
## 5 3 1 `%voids` 0.17970999 0.1514241240
## 4 4 1 `%fines` 0.07244760 0.0404630335
## 3 5 1 `%base` 0.04418469 0.0112255443
## 2 6 1 `%surf` 0.03260116 -0.0007574194
## 15 7 2 `%surf` run 0.69469939 0.6728922025
## 18 8 2 `%base` run 0.67054201 0.6470093012
## 21 9 2 `%voids` run 0.66837793 0.6446906378
## 20 10 2 `%fines` run 0.66422323 0.6402391724
## 11 11 2 visc run 0.65895777 0.6345976113
## 10 12 2 visc `%voids` 0.34222182 0.2952376594
## 9 13 2 visc `%fines` 0.33205132 0.2843406952
## 7 14 2 visc `%surf` 0.32275960 0.2743852908
## 8 15 2 visc `%base` 0.32205656 0.2736320274
## 19 16 2 `%fines` `%voids` 0.28241390 0.2311577462
## 17 17 2 `%base` `%voids` 0.21922390 0.1634541745
## 14 18 2 `%surf` `%voids` 0.17971528 0.1211235195
## 13 19 2 `%surf` `%fines` 0.12034404 0.0575114670
## 12 20 2 `%surf` `%base` 0.10409255 0.0400991655
## 16 21 2 `%base` `%fines` 0.08304511 0.0175483337
## 34 22 3 `%surf` `%base` run 0.72392146 0.6932460640
## 36 23 3 `%surf` `%fines` run 0.70650950 0.6738994421
## 37 24 3 `%surf` `%voids` run 0.69488278 0.6609808705
## 25 25 3 visc `%surf` run 0.69470235 0.6607803907
## 40 26 3 `%base` `%voids` run 0.68193515 0.6465946133
## 41 27 3 `%fines` `%voids` run 0.68035604 0.6448400402
## 39 28 3 `%base` `%fines` run 0.67189892 0.6354432470
## 28 29 3 visc `%base` run 0.67096103 0.6344011455
## 31 30 3 visc `%voids` run 0.66840244 0.6315582685
## 30 31 3 visc `%fines` run 0.66507010 0.6278556636
## 29 32 3 visc `%fines` `%voids` 0.38197598 0.3133066485
## 27 33 3 visc `%base` `%voids` 0.35641483 0.2849053620
## 23 34 3 visc `%surf` `%fines` 0.35039467 0.2782163012
## 24 35 3 visc `%surf` `%voids` 0.34361049 0.2706783259
## 22 36 3 visc `%surf` `%base` 0.34353968 0.2705996495
## 26 37 3 visc `%base` `%fines` 0.33455532 0.2606170192
## 38 38 3 `%base` `%fines` `%voids` 0.28642038 0.2071337507
## 35 39 3 `%surf` `%fines` `%voids` 0.28320788 0.2035643114
## 33 40 3 `%surf` `%base` `%voids` 0.22299454 0.1366606039
## 32 41 3 `%surf` `%base` `%fines` 0.14516806 0.0501867364
## 53 42 4 `%surf` `%base` `%fines` run 0.72576870 0.6835792648
## 44 43 4 visc `%surf` `%base` run 0.72515239 0.6828681403
## 54 44 4 `%surf` `%base` `%voids` run 0.72397377 0.6815081916
## 55 45 4 `%surf` `%fines` `%voids` run 0.70779239 0.6628373697
## 46 46 4 visc `%surf` `%fines` run 0.70667231 0.6615449762
## 47 47 4 visc `%surf` `%voids` run 0.69488635 0.6479457881
## 56 48 4 `%base` `%fines` `%voids` run 0.68625512 0.6379866818
## 50 49 4 visc `%base` `%voids` run 0.68214999 0.6332499858
## 51 50 4 visc `%fines` `%voids` run 0.68062905 0.6314950598
## 49 51 4 visc `%base` `%fines` run 0.67225422 0.6218317947
## 45 52 4 visc `%surf` `%fines` `%voids` 0.38454367 0.2898580768
## 48 53 4 visc `%base` `%fines` `%voids` 0.38370552 0.2888909813
## 43 54 4 visc `%surf` `%base` `%voids` 0.36211989 0.2639844838
## 42 55 4 visc `%surf` `%base` `%fines` 0.35797706 0.2592043041
## 52 56 4 `%surf` `%base` `%fines` `%voids` 0.28882879 0.1794178396
## 58 57 5 visc `%surf` `%base` `%fines` run 0.72716609 0.6725993100
## 62 58 5 `%surf` `%base` `%fines` `%voids` run 0.72579057 0.6709486885
## 59 59 5 visc `%surf` `%base` `%voids` run 0.72515420 0.6701850413
## 60 60 5 visc `%surf` `%fines` `%voids` run 0.70839139 0.6500696665
## 61 61 5 visc `%base` `%fines` `%voids` run 0.68682460 0.6241895189
## 57 62 5 visc `%surf` `%base` `%fines` `%voids` 0.38801966 0.2656235867
## 63 63 6 visc `%surf` `%base` `%fines` `%voids` run 0.72739899 0.6592487329
## Mallow's Cp
## 6 3.148830
## 1 33.664035
## 5 45.218955
## 4 54.662417
## 3 57.150705
## 2 58.170527
## 15 1.878897
## 18 4.005731
## 21 4.196259
## 20 4.562042
## 11 5.025617
## 10 32.911290
## 9 33.806709
## 7 34.624758
## 8 34.686655
## 19 38.176825
## 17 43.740120
## 14 47.218488
## 13 52.445578
## 12 53.876371
## 16 55.729404
## 34 1.306164
## 36 2.839126
## 37 3.862751
## 25 3.878637
## 40 5.002670
## 41 5.141697
## 39 5.886268
## 28 5.968840
## 31 6.194101
## 30 6.487483
## 29 31.411303
## 27 33.661727
## 23 34.191746
## 24 34.789030
## 22 34.795264
## 26 35.586255
## 38 39.824091
## 35 40.106922
## 33 45.408150
## 32 52.260052
## 53 3.143532
## 44 3.197792
## 54 3.301559
## 55 4.726180
## 46 4.824792
## 47 5.862437
## 56 6.622337
## 50 6.983756
## 51 7.117661
## 49 7.854987
## 45 33.185242
## 48 33.259034
## 43 35.159449
## 42 35.524186
## 52 41.612052
## 58 5.020504
## 62 5.141606
## 59 5.197633
## 60 6.673443
## 61 8.572200
## 57 34.879214
## 63 7.000000
fit<-lm(y~`%surf`+`%base`+run,data = asphalt)
summary(fit)
##
## Call:
## lm(formula = y ~ `%surf` + `%base` + run, data = asphalt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9414 -1.7181 -0.1026 1.4959 11.0532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.6005 28.0368 -2.233 0.0340 *
## `%surf` 7.4270 3.2506 2.285 0.0304 *
## `%base` 6.8282 4.0391 1.691 0.1024
## run -5.2685 0.6767 -7.786 2.26e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.738 on 27 degrees of freedom
## Multiple R-squared: 0.7239, Adjusted R-squared: 0.6932
## F-statistic: 23.6 on 3 and 27 DF, p-value: 1.044e-07
Function ols_step_all_possible is used to check all the possibilites of the subset regression to select the best model. Here we need to select best model for that we will use adjusted r squared value. Because R square adjusted: used in the comparison of models with different number of variables.
From observation it is clear that {%surf,%base,run} have the highest adj R squared value 69.32% hence this is the best model.
We will fit this three features and summerise the new fit model.
#Testing for outliers
plot(fit)
Residuals vs Fitted:We donot get a flat line thus model do not have a linear relationship. Hence heteroscedasticity assumption is violated.
Normal Q-Q plot:Large devation can be seen. Residuals are not distributes and it violates the assumption
Sclae -location:There is no straight line. There is no sign of homoscedasticity hence the assumption is violated.
Residuals vs Leverage: There is no values in the upper and lower right hand side of the plot inside the red bands, therefore there is no evidence of influential cases.
#Testing for outliers
outlierTest(fit)
## rstudent unadjusted p-value Bonferroni p
## 13 4.351371 0.00018611 0.0057695
Assumptions; H0 = observation is NOT an outlier HA = observation is an outlier
OutlierTest function runs a Bonferonni adjusted outlier test. Null hypothesis for this test is that the observation is NOT an outlier. Observation 13 is the extreme observation in our data, here the null can be rejected because the Bonferroni p is less than 0.05.
#Testing for Normality
# distribution of studentized residuals
sresid <- studres(fit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
Residuals are not distributed properly, graph is tend towards left. Normality dose not look good. Hence normality is violated.
shapiro.test(fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit$residuals
## W = 0.93002, p-value = 0.04391
Assumptions H0 = Errors are normally distributed HA = Errors are not normally distributed
Here P value is less than 0.05 hence we reject the H0 and accept HA. Hence errors are not normally distributed.
#Testing the Homoscedasticity Assumption
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 13.25413, Df = 1, p = 0.00027198
P < .05, suggesting that our data is not homoscedastic. The constant error variance assumption is violated.
#Testing the Independence Assumption
durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.1063792 2.159149 0.722
## Alternative hypothesis: rho != 0
Assumptions H0 = Errors are uncorrelated H1 = Errors are correlated
P > 0.05, so the errors are nuncorrelated.
byssinosis <- read_csv("/Users/shubhamchougule/Downloads/byssinosis.csv")
## Parsed with column specification:
## cols(
## BysYes = col_double(),
## BysNo = col_double(),
## Total = col_double(),
## Dust = col_double(),
## Race = col_double(),
## Sex = col_double(),
## Smoke = col_double(),
## Employ = col_double()
## )
model1<-glm(cbind(BysNo,Total-BysNo)~Dust + Race + Sex + Smoke + Employ,family = "binomial",data=byssinosis)
summary(model1)
##
## Call:
## glm(formula = cbind(BysNo, Total - BysNo) ~ Dust + Race + Sex +
## Smoke + Employ, family = "binomial", data = byssinosis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9804 -0.3688 0.2421 0.7573 3.4126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4852 0.6060 0.801 0.42331
## Dust 1.3751 0.1155 11.901 < 2e-16 ***
## Race -0.2463 0.2061 -1.195 0.23203
## Sex 0.2590 0.2116 1.224 0.22095
## Smoke 0.6292 0.1931 3.259 0.00112 **
## Employ -0.3856 0.1069 -3.607 0.00031 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 322.527 on 64 degrees of freedom
## Residual deviance: 69.509 on 59 degrees of freedom
## AIC: 188.19
##
## Number of Fisher Scoring iterations: 5
To fit logistic regression we use glm function. From the summary we see that residual deviance :69.509 on 59 degrees of freedom.
Main equation formed is
y=0.4852+(1.3751)Dust-(0.2463)Race+(0.2590)Sex+(0.6292)Smoke-(0.3856)*Employ
From the predictors Dust,Smoke and Employ have a significant effect on the presence of byssinosis based on the p value obtained which is less than 0.05 makes this three predictors significant.
Building model for significant model
model<-glm(cbind(BysNo,Total-BysNo)~Dust + Smoke + Employ,family = "binomial",data=byssinosis)
summary(model)
##
## Call:
## glm(formula = cbind(BysNo, Total - BysNo) ~ Dust + Smoke + Employ,
## family = "binomial", data = byssinosis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0523 -0.4001 0.2518 0.7700 3.3421
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.14177 0.34120 0.415 0.677783
## Dust 1.46572 0.10578 13.856 < 2e-16 ***
## Smoke 0.67781 0.18871 3.592 0.000328 ***
## Employ -0.33313 0.08861 -3.760 0.000170 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 322.527 on 64 degrees of freedom
## Residual deviance: 72.562 on 61 degrees of freedom
## AIC: 187.24
##
## Number of Fisher Scoring iterations: 5
Equation formed for significant predictors is
y=0.142+(1.466)Dust+(0.678)Smoke-(0.333)*Employ
#Check the adequacy of main model
deviance(model1)
## [1] 69.50926
# Gives the p value for the goodness of fit test
pchisq(model1$deviance, df=model1$df.residual, lower.tail=FALSE)
## [1] 0.1645594
Assumptions H0:the model fits data well HA:the model does not fits data well
Chiq-sq test stat with 69.509 on 59 degrees of freedom. gives p values of 0.1645 indicating that null hypothesis is plausible and we conclude that logistic model is adequate.
#Check the adequacy of significant model
deviance(model)
## [1] 72.56192
# Gives the p value for the goodness of fit test
pchisq(model$deviance, df=model$df.residual, lower.tail=FALSE)
## [1] 0.1476404
Assumptions H0:the model fits data well HA:the model does not fits data well
Chiq-sq test stat with 72.562 on 61 degrees of freedom. gives p values of 0.1476 indicating that null hypothesis is plausible and we conclude that logistic model is adequate.
Here we will use the significant model for calculation
y=0.142+(1.466)Dust+(0.678)Smoke-(0.333)*Employ
Only we will use x1,x4 and x5 for calculation as x2 and x3 are not significant. Considering y as the fitted logit
y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5
y=0.142+(1.466)(2)+(0.678)(2)-(0.333)*(3)
y=3.431
probability can be calculate by using
p=exponential(y)/(1+exponential(y))
p=exponential(3.431)/(1+exponential(3.431))
p=0.968
y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5
y=0.142+(1.466)(1)+(0.678)(1)-(0.333)*(3)
y=1.287
probability can be calculate by using
p=exponential(y)/(1+exponential(y))
p=exponential(1.287)/(1+exponential(1.287))
p=0.783
y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5
y=0.142+(1.466)(2)+(0.678)(2)-(0.333)*(2)
y=3.764
probability can be calculate by using
p=exponential(y)/(1+exponential(y))
p=exponential(3.764)/(1+exponential(3.764))
p=0.977
y=0.142+(1.466)x1+(0.678)x4-(0.333)*x5
y=0.142+(1.466)(3)+(0.678)(2)-(0.333)*(1)
y=5.563
probability can be calculate by using
p=exponential(y)/(1+exponential(y))
p=exponential(5.563)/(1+exponential(5.563))
p=0.996