Assignment2 Regression Analysis

Question 1

Importing dataset

liver <- read_csv("~/Downloads/liver.CSV")
## Parsed with column specification:
## cols(
##   BdyWt = col_double(),
##   LvrWt = col_double(),
##   Dose = col_double(),
##   Y = col_double()
## )

Question 1 (a)

#Converting the dataset to matix form to do the calculations
mat<-data.matrix(liver, rownames.force = NA)

#transpose of the matrix is done by using t() function
tmat<-t(mat)

#Calculation of  (X'X)-1 for the design matrix
first<-inv(tmat%*%mat)
first
##                                                     
## [1,]  0.01303019 -0.00666706  -2.6008804   0.1798586
## [2,] -0.00666706  0.05181011   0.9235123  -0.1702090
## [3,] -2.60088043  0.92351235 523.7417028 -38.0768924
## [4,]  0.17985862 -0.17020896 -38.0768924   9.9239625
#In the output we will get a matrix of 4*4

(X’X)-1 is calculated.

Question 1 (b)

#Applying multiple linear regression formula
fit<-lm(Y~BdyWt+LvrWt+Dose,data = liver)
#Calculating the Summary
summary(fit)
## 
## Call:
## lm(formula = Y ~ BdyWt + LvrWt + Dose, data = liver)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.100557 -0.063233  0.007131  0.045971  0.134691 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.265922   0.194585   1.367   0.1919  
## BdyWt       -0.021246   0.007974  -2.664   0.0177 *
## LvrWt        0.014298   0.017217   0.830   0.4193  
## Dose         4.178111   1.522625   2.744   0.0151 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07729 on 15 degrees of freedom
## Multiple R-squared:  0.3639, Adjusted R-squared:  0.2367 
## F-statistic:  2.86 on 3 and 15 DF,  p-value: 0.07197
#Best fit model observed is
#y=0.265922-0.021246(BdyWt)+0.014298(LvrWt)+4.178111(Dose)

The summary gives the Adj R-Squared value that is 23% which is quite less. Also, the p-value obtains for BdyWt and Dose is significantly less than 0.05 but not significant for LvrWt. In further analysis, we will look at which pair is the best fit.

Best fitting model

y=0.265922-0.021246(BdyWt)+0.014298(LvrWt)+4.178111(Dose)

Question 1 (c)

#ANOVA test as a 
mod0 <- lm(Y ~ 1,data = liver)
anova(fit,mod0)
## Analysis of Variance Table
## 
## Model 1: Y ~ BdyWt + LvrWt + Dose
## Model 2: Y ~ 1
##   Res.Df      RSS Df Sum of Sq      F  Pr(>F)  
## 1     15 0.089609                              
## 2     18 0.140874 -3 -0.051265 2.8605 0.07197 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA test gives the overall P values of the model as 0.07197 which is greater than alpha value 0.05 that suggests the model is not significant and here we cannot reject the null hypothesis.

#T test

#Critical Value calculation
abs(qt(0.05/2, 15))
## [1] 2.13145
summary(fit)
## 
## Call:
## lm(formula = Y ~ BdyWt + LvrWt + Dose, data = liver)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.100557 -0.063233  0.007131  0.045971  0.134691 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.265922   0.194585   1.367   0.1919  
## BdyWt       -0.021246   0.007974  -2.664   0.0177 *
## LvrWt        0.014298   0.017217   0.830   0.4193  
## Dose         4.178111   1.522625   2.744   0.0151 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07729 on 15 degrees of freedom
## Multiple R-squared:  0.3639, Adjusted R-squared:  0.2367 
## F-statistic:  2.86 on 3 and 15 DF,  p-value: 0.07197

From the summary function, we get the individual ttest values. Ttest for BdyWt

1.Specify the hypothesis H0:B1=0 H1:B1!=0

at alpha=0.05

2.Determine the t_value for BdyWt

t_value(BdyWt)=-2.664

3.Check if t_value lies with the critical value

the critical value is calculated for 15 degrees of freedom is (2.13145,-2.13145)

Here the t_value does not lie between the critical value hence we can reject the null hypothesis. Also the P-value < 0.05 so for BdyWt we reject the null hypothesis and conclude that B1 is significant.

Ttest for LvrWt

1.Specify the hypothesis H0:B2=0 H1:B2!=0

at alpha=0.05

2.Determine the t_value for LvrWt

t_value(LvrWt)=0.830

3.Check if t_value lies with the critical value

the critical value is calculated for 15 degrees of freedom is (2.13145,-2.13145)

Here the t_value does lie between the critical value hence we cannot reject the null hypothesis. Also, the P-value > 0.05 so for LvrWt we do not reject the null hypothesis and conclude that B2 is not significant.

Ttest for Dose

1.Specify the hypothesis H0:B3=0 H1:B3!=0

at alpha=0.05

2.Determine the t_value for Dose

t_value(BdyWt)=2.744

3.Check if t_value lies with the critical value

the critical value is calculated for 15 degrees of freedom is (2.13145,-2.13145)

Here the t_value does not lie between the critical value hence we can reject the null hypothesis. Also the P-value < 0.05 so for Dose we reject the null hypothesis and conclude that B3 is significant.

Hence from the above ttest we can conclude that LvrWt predictor can be left out of the model and can carry on with the BdyWt and Dose predictors on basis of significance.

Question 1 (d)

#Multicollinearity between the predictors can be seen using the correlation matrix
mydata<-liver[,c(1:3)]
cor(mydata)
##           BdyWt     LvrWt      Dose
## BdyWt 1.0000000 0.5000101 0.9902126
## LvrWt 0.5000101 1.0000000 0.4900711
## Dose  0.9902126 0.4900711 1.0000000
corrplot(cor(mydata))

From the correlation matrix, we see a high correlation between BdyWt and Dose. This can be seen from the graph also. Hence BdyWt and Dose predictors are strongly correlated.

Question 1 (e)

#Backward Elimination
step(fit,direction ="backward",Fstay=4)
## Start:  AIC=-93.78
## Y ~ BdyWt + LvrWt + Dose
## 
##         Df Sum of Sq      RSS     AIC
## - LvrWt  1  0.004120 0.093729 -94.924
## <none>               0.089609 -93.778
## - BdyWt  1  0.042408 0.132017 -88.416
## - Dose   1  0.044982 0.134591 -88.049
## 
## Step:  AIC=-94.92
## Y ~ BdyWt + Dose
## 
##         Df Sum of Sq      RSS     AIC
## <none>               0.093729 -94.924
## - BdyWt  1  0.039851 0.133580 -90.192
## - Dose   1  0.043929 0.137658 -89.621
## 
## Call:
## lm(formula = Y ~ BdyWt + Dose, data = liver)
## 
## Coefficients:
## (Intercept)        BdyWt         Dose  
##     0.28552     -0.02044      4.12533

In the stepwise backward elimination, started with all the predictors, and then least contributive predictors are been eliminated until all the predictors are significant. Here in our model LvrWt predictor has been eliminated and the best model obtained to fit is Y ~ BdyWt + Dose.

Question 2

crest <- read_xlsx("~/Desktop/RMIT/Sem 3/Regression Analysis/Assignment 2/toothpaste.xlsx")

Fitting the model and Significance

crestfit<-lm(y~x1+x2+x3,data=crest)
summary(crestfit)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = crest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25305.8  -1092.4   -273.6   3212.0  17611.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.827e+04  2.195e+04   1.744    0.112    
## x1          -5.997e-02  5.071e-01  -0.118    0.908    
## x2          -1.641e+03  2.011e+04  -0.082    0.937    
## x3           1.164e+02  1.035e+01  11.248 5.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11160 on 10 degrees of freedom
## Multiple R-squared:  0.958,  Adjusted R-squared:  0.9455 
## F-statistic: 76.11 on 3 and 10 DF,  p-value: 3.458e-07

Looking at the summary we see that only the x3 predictor is significant as the p-value is less than that of the alpha 0.05. But the model has a good Adj R squared value of 94% which is high that is satisfaction variable is quite good.. x1 and x2 seem not significant because p-value >0.05. But the overall p-value looks quite significant which is 3.458e-07. Model fit is

y=38271.18 + -0.060 x1 -1640.79 x2 +116.41 x3

From the above result we can say that x3 variable is important.

par(mfrow=c(2,2))
plot(crestfit, which = 1:4)

Firstly when we are checking the assumptions we check the plots. In the first Residuals vs Fitted graph, we see that the linearity is good points lie within the 0 lines. In the second Normal Q-Q plot, we see that its normality is fine in which most points are on the line. In the third graph of the Scale-location graph, we do not find ups and downs so the graph looks OK. In the fourth graph of cooks distance, we see a single observation that has potential.

Residual Analysis

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(crestfit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.1343758, Df = 1, p = 0.71394

Here the p value is high so we don’t have evidence to reject the null hypothesis. This suggests that we have a constant variance which means that the constant error variance assumption is not violated.

# Test for Autocorrelated Errors

acf(crestfit$residuals)

durbinWatsonTest(crestfit)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.0248056      1.920721   0.426
##  Alternative hypothesis: rho != 0

Seeing the ACF plot all the points lie inside the blue line which states that non of the lag are significant so there is no autocorrelation in the residuals states that assumptions of independent errors are ok.

From Durbin Watson’s p-value we can say that independent assumption is ok.

shapiro.test(crestfit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  crestfit$residuals
## W = 0.87971, p-value = 0.05756

Here we see that p value is less and the normality assumption is violated.

Multicollinearity

mydata2<-crest[,c(3:5)]
cor(mydata2)
##           x1        x2        x3
## x1 1.0000000 0.1269699 0.3498036
## x2 0.1269699 1.0000000 0.6154342
## x3 0.3498036 0.6154342 1.0000000
corrplot(cor(mydata2))

In the multicollinearity test of correlation, we see the maximum correlation of 0.6154342 we can say that x2 and x3 may have some correlation.

vif(crestfit)
##       x1       x2       x3 
## 1.155957 1.633041 1.830726

From the above VIF value, the greatest value of VIF is 1.830726 which is not greater than 10 which is within tolerance. So this suggests that there is no multicollinearity in the table.

Regression procedure

# Full model shoul contains all the variables 
stepwise<-crest[,c(2:5)]
full=lm(y~., data=stepwise)

# null model contains no variable
null=lm(y~1, data=stepwise)
step(full,direction ="backward")
## Start:  AIC=264.24
## y ~ x1 + x2 + x3
## 
##        Df  Sum of Sq        RSS    AIC
## - x2    1 8.2878e+05 1.2452e+09 262.25
## - x1    1 1.7408e+06 1.2461e+09 262.26
## <none>               1.2443e+09 264.24
## - x3    1 1.5743e+10 1.6987e+10 298.83
## 
## Step:  AIC=262.25
## y ~ x1 + x3
## 
##        Df  Sum of Sq        RSS    AIC
## - x1    1 1.4866e+06 1.2467e+09 260.27
## <none>               1.2452e+09 262.25
## - x3    1 2.5070e+10 2.6315e+10 302.96
## 
## Step:  AIC=260.27
## y ~ x3
## 
##        Df  Sum of Sq        RSS    AIC
## <none>               1.2467e+09 260.27
## - x3    1 2.8411e+10 2.9658e+10 302.63
## 
## Call:
## lm(formula = y ~ x3, data = stepwise)
## 
## Coefficients:
## (Intercept)           x3  
##     36068.4        115.6

In the first stage, AIC is 264.24, here if x2 is removed then the lowest value will be 262.25 so x2 can be removed to obtain the lowest AIC in the backward regression procedure. In the second stage, AIC is 262.25 if x1 is removed then the AIC obtained is reduced to 260.27 then we remove x1. But in the third stage, AIC is 260.27 and if we remove x3 then the value will be increased to 302.63 hence stop here and will accept the x3 which is the best fit.

#forward selection using AIC values
step(null, scope=list(lower=null, upper=full), direction="forward")
## Start:  AIC=302.64
## y ~ 1
## 
##        Df  Sum of Sq        RSS    AIC
## + x3    1 2.8411e+10 1.2467e+09 260.27
## + x2    1 1.0637e+10 1.9020e+10 298.42
## <none>               2.9658e+10 302.63
## + x1    1 3.3431e+09 2.6315e+10 302.96
## 
## Step:  AIC=260.27
## y ~ x3
## 
##        Df Sum of Sq        RSS    AIC
## <none>              1246663758 260.27
## + x1    1   1486631 1245177127 262.25
## + x2    1    574590 1246089168 262.26
## 
## Call:
## lm(formula = y ~ x3, data = stepwise)
## 
## Coefficients:
## (Intercept)           x3  
##     36068.4        115.6

We have a total AIC of 302.64. In the initial stage, we will consider the predictor which has the least AIC which is x3 260.27. Going to the next stage, here if we add x2 which have the least AIC of 262.26 then the AIC value will decrease because we in forwarding AIC we need a maximum value of AIC hence we stop at the second stage and do not select any more predictors. So x3 is the best fit.

step(crestfit,direction = "both")
## Start:  AIC=264.24
## y ~ x1 + x2 + x3
## 
##        Df  Sum of Sq        RSS    AIC
## - x2    1 8.2878e+05 1.2452e+09 262.25
## - x1    1 1.7408e+06 1.2461e+09 262.26
## <none>               1.2443e+09 264.24
## - x3    1 1.5743e+10 1.6987e+10 298.83
## 
## Step:  AIC=262.25
## y ~ x1 + x3
## 
##        Df  Sum of Sq        RSS    AIC
## - x1    1 1.4866e+06 1.2467e+09 260.27
## <none>               1.2452e+09 262.25
## + x2    1 8.2878e+05 1.2443e+09 264.24
## - x3    1 2.5070e+10 2.6315e+10 302.96
## 
## Step:  AIC=260.27
## y ~ x3
## 
##        Df  Sum of Sq        RSS    AIC
## <none>               1.2467e+09 260.27
## + x1    1 1.4866e+06 1.2452e+09 262.25
## + x2    1 5.7459e+05 1.2461e+09 262.26
## - x3    1 2.8411e+10 2.9658e+10 302.63
## 
## Call:
## lm(formula = y ~ x3, data = crest)
## 
## Coefficients:
## (Intercept)           x3  
##     36068.4        115.6

In the first stage, AIC is 264.24, here if x2 is removed then the lowest value will be 262.25 so x2 can be removed to obtain the lowest AIC in the stepwise regression procedure. In the second stage, AIC is 262.25 if x1 is removed then the AIC obtained is reduced to 260.27 then we remove x1. But in the third stage, AIC is 260.27 and if we remove x3 then the value will be increased to 302.63 hence stop here and will accept the x3 which is the best fit.

Stepwise and backward regression procedures have a same number of steps performed where in forwarding regression procedures number of steps are two.

Conclusion

From all the above observation we can say that x3 predictor is the best fit to the model.