An experiment was conducted to investigate the amount of drug retained in the liver of a rat. Nineteen rats were chosen for the experiment and each rat was administered a dose of the drug. After a fixed length of time, the rat was sacrificed, the liver weighed and the percentage of the dose in the liver determined. The questions can be attempted using whatever is most convenient, a computer package or hand calculations. When computer software is used you need to include the output and code you used (or details on the steps you followed) to generate the output. The variables are: X1 = body weight of rat X2 = weight of liver X3 = dose given Y = percentage of dose in liver.
library(readr)
library(dplyr)
library(tidyverse)
library(car)
We have imported the necessary packages into R which will come handy in our further assignment.
#import the data
liver <- read.csv("liver.csv")
liver
Here we have imported the data into R and have a glimpse at our “Liver” dataset.
#convert into matrix
livermatrix <- data.matrix(liver)
livermatrix
BdyWt LvrWt Dose Y
[1,] 176 6.5 0.88 0.42
[2,] 176 9.5 0.88 0.25
[3,] 190 9.0 1.00 0.56
[4,] 176 8.9 0.88 0.23
[5,] 200 7.2 1.00 0.23
[6,] 167 8.9 0.83 0.32
[7,] 188 8.0 0.94 0.37
[8,] 195 10.0 0.98 0.41
[9,] 176 8.0 0.88 0.33
[10,] 165 7.9 0.84 0.38
[11,] 158 6.9 0.80 0.27
[12,] 148 7.3 0.74 0.36
[13,] 149 5.2 0.75 0.21
[14,] 163 8.4 0.81 0.28
[15,] 170 7.2 0.85 0.34
[16,] 186 6.8 0.94 0.28
[17,] 146 7.3 0.73 0.30
[18,] 181 9.0 0.90 0.37
[19,] 149 6.4 0.75 0.46
#check class
class(livermatrix)
[1] "matrix" "array"
We have converted the Liver data frame into matrix and confirmed it by checking the class of the matrix.
#convert x and y matrix
x_matrix <- liver %>% select(1,2,3)
x <- data.matrix(x_matrix)
x <- cbind(c(1),x)
x
BdyWt LvrWt Dose
[1,] 1 176 6.5 0.88
[2,] 1 176 9.5 0.88
[3,] 1 190 9.0 1.00
[4,] 1 176 8.9 0.88
[5,] 1 200 7.2 1.00
[6,] 1 167 8.9 0.83
[7,] 1 188 8.0 0.94
[8,] 1 195 10.0 0.98
[9,] 1 176 8.0 0.88
[10,] 1 165 7.9 0.84
[11,] 1 158 6.9 0.80
[12,] 1 148 7.3 0.74
[13,] 1 149 5.2 0.75
[14,] 1 163 8.4 0.81
[15,] 1 170 7.2 0.85
[16,] 1 186 6.8 0.94
[17,] 1 146 7.3 0.73
[18,] 1 181 9.0 0.90
[19,] 1 149 6.4 0.75
y_matrix <- liver %>% select(4)
y <- data.matrix(y_matrix)
y
Y
[1,] 0.42
[2,] 0.25
[3,] 0.56
[4,] 0.23
[5,] 0.23
[6,] 0.32
[7,] 0.37
[8,] 0.41
[9,] 0.33
[10,] 0.38
[11,] 0.27
[12,] 0.36
[13,] 0.21
[14,] 0.28
[15,] 0.34
[16,] 0.28
[17,] 0.30
[18,] 0.37
[19,] 0.46
In above step, we have created two different matrix and named as “X” and “Y” matrix. For solving Question 1a, we need to add a dummy variable in matrix “X” with values “1”. Now we will solve Question 1a.
#Q1a
# X'
Tx <- t(x)
Tx
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
1.00 1.00 1 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.00 1.00
BdyWt 176.00 176.00 190 176.00 200.0 167.00 188.00 195.00 176.00 165.00 158.0 148.00 149.00 163.00
LvrWt 6.50 9.50 9 8.90 7.2 8.90 8.00 10.00 8.00 7.90 6.9 7.30 5.20 8.40
Dose 0.88 0.88 1 0.88 1.0 0.83 0.94 0.98 0.88 0.84 0.8 0.74 0.75 0.81
[,15] [,16] [,17] [,18] [,19]
1.00 1.00 1.00 1.0 1.00
BdyWt 170.00 186.00 146.00 181.0 149.00
LvrWt 7.20 6.80 7.30 9.0 6.40
Dose 0.85 0.94 0.73 0.9 0.75
#x'x
xx <- Tx%*%x
xx
BdyWt LvrWt Dose
19.00 3259.00 148.400 16.3800
BdyWt 3259.00 563899.00 25636.000 2834.8200
LvrWt 148.40 25636.00 1186.000 128.8620
Dose 16.38 2834.82 128.862 14.2538
#(x'x)-1....inverse
inverse_x <- solve(xx)
inverse_x
BdyWt LvrWt Dose
6.33809378 -0.074426957 -0.068005387 8.133435
BdyWt -0.07442696 0.010644476 -0.002783671 -2.006297
LvrWt -0.06800539 -0.002783671 0.049620475 0.183175
Dose 8.13343533 -2.006296702 0.183175008 388.083181
In the Above step, we have done transpose of matrix X and named as “Tx”. After the transpose, we have multiplied that with X matrix and named that matrix as XX. At the end to find the inverse, “Solve” function was used and named as inverse_x.
q2 <- lm(Y~BdyWt+LvrWt+Dose, liver)
summary(q2)
Call:
lm(formula = Y ~ BdyWt + LvrWt + Dose, data = liver)
Residuals:
Min 1Q Median 3Q Max
-0.100557 -0.063233 0.007131 0.045971 0.134691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.265922 0.194585 1.367 0.1919
BdyWt -0.021246 0.007974 -2.664 0.0177 *
LvrWt 0.014298 0.017217 0.830 0.4193
Dose 4.178111 1.522625 2.744 0.0151 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07729 on 15 degrees of freedom
Multiple R-squared: 0.3639, Adjusted R-squared: 0.2367
F-statistic: 2.86 on 3 and 15 DF, p-value: 0.07197
In the above step, we have calculated the equation of best fit by using lm function. We have used all the 3 variables of X as resposnse to Y variable. Through above step, we found the equation of best fit as:
Y^ = 0.26 - 0.21X1 + 0.01X2 + 4.17X3.
We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0. Multiple R square is 0.36 and Adjusted R- square is observed as 0.23.
# Anova
anova(q2)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
BdyWt 1 0.003216 0.003216 0.5383 0.47446
LvrWt 1 0.003067 0.003067 0.5134 0.48467
Dose 1 0.044982 0.044982 7.5296 0.01507 *
Residuals 15 0.089609 0.005974
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the above we can the values of Sum square of residuals (0.089) and the X1 (0.003), X2(0.003), X3(0.44) and the mean squares as well of both the values. ANOVA table also highlights the F value and the PR value as respective to F. In this case the F values are X1(o.538), X2(0.513) and X3(7.529).
Now, we will perform individual t test on the Co-efficients.
#x1 ttest
t.test(liver$BdyWt)
One Sample t-test
data: liver$BdyWt
t = 45.34, df = 18, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
163.5782 179.4744
sample estimates:
mean of x
171.5263
In the result above :
t is the t-test statistic value (t = 45.34), df is the degrees of freedom (df= 18), p-value is the significance level of the t-test (p-value <0.05). conf.int is the confidence interval of the mean at 95% (conf.int = [163.57, 179.47]); sample estimates is he mean value of the sample (mean = 171.5263).
The p-value of the test is smaller then 0.05.
#x2 ttest
t.test(liver$LvrWt)
One Sample t-test
data: liver$LvrWt
t = 27.84, df = 18, p-value = 2.995e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
7.221116 8.399936
sample estimates:
mean of x
7.810526
In the result above :
t is the t-test statistic value (t = 27.84), df is the degrees of freedom (df= 18), p-value is the significance level of the t-test (p-value <0.05). conf.int is the confidence interval of the mean at 95% (conf.int = [7.221, 8.399]); sample estimates is he mean value of the sample (mean = 7.810).
The p-value of the test is smaller then 0.05.
#x3 ttest
t.test(liver$Dose)
One Sample t-test
data: liver$Dose
t = 43.797, df = 18, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.8207500 0.9034605
sample estimates:
mean of x
0.8621053
In the result above :
t is the t-test statistic value (t = 43.797), df is the degrees of freedom (df= 18), p-value is the significance level of the t-test (p-value <0.05). conf.int is the confidence interval of the mean at 95% (conf.int = [0.820, 0.903]); sample estimates is he mean value of the sample (mean = 0.862).
The p-value of the test is smaller then 0.05.
After careful evaluation and performing one ttest on all perdictors, we can say that Liver weight has the lowest t value, therefore this one can be left out of the model.
#1e
step(q2, data=liver, direction = "backward")
Start: AIC=-93.78
Y ~ BdyWt + LvrWt + Dose
Df Sum of Sq RSS AIC
- LvrWt 1 0.004120 0.093729 -94.924
<none> 0.089609 -93.778
- BdyWt 1 0.042408 0.132017 -88.416
- Dose 1 0.044982 0.134591 -88.049
Step: AIC=-94.92
Y ~ BdyWt + Dose
Df Sum of Sq RSS AIC
<none> 0.093729 -94.924
- BdyWt 1 0.039851 0.133580 -90.192
- Dose 1 0.043929 0.137658 -89.621
Call:
lm(formula = Y ~ BdyWt + Dose, data = liver)
Coefficients:
(Intercept) BdyWt Dose
0.28552 -0.02044 4.12533
Here we have performed Backward elimination method to check the best small parsimonious model Using AIC approach.
We can see in the beginning AIC was 93.78 with all the three variables. After the next step, AIC increased to 94.42 and Liver Weight variable was removed. Therefore, After backward elimination, our Final model contains *body weight and Dose variables. The co-efficient of the final model is 0.285 for the intercept, -0.020 for Body weight and 4.125 for Dose variables.
crest <- read.csv("crest.csv")
crest
In the above step, we have imported Crest data frame and now we will perform our evaluation on it.
To check that, we have to make a model using lm function.
# lm model
crestlm <- lm(Crest.Sales.Y~Crest.Budget.X1+Ratio.X2+Us.Personal.disposable.income.x3, data=crest)
summary(crestlm)
Call:
lm(formula = Crest.Sales.Y ~ Crest.Budget.X1 + Ratio.X2 + Us.Personal.disposable.income.x3,
data = crest)
Residuals:
Min 1Q Median 3Q Max
-24088 -2568 1021 3836 10100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34104.559 17654.144 1.932 0.082187 .
Crest.Budget.X1 3.746 1.976 1.896 0.087243 .
Ratio.X2 -30046.343 22859.674 -1.314 0.218066
Us.Personal.disposable.income.x3 85.926 17.911 4.797 0.000727 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9574 on 10 degrees of freedom
Multiple R-squared: 0.9691, Adjusted R-squared: 0.9598
F-statistic: 104.5 on 3 and 10 DF, p-value: 7.537e-08
From the above model, We can see the predictors values of every variable. From this we can say that, crest Budget X1, ratio X2 and Ratio X2 has P value of 0.082, 0.087 and 0.218 respectively. Therefore, we can say that these variables are insignificant as their P value is greater then 0.05. Whereas US personal disposable income X3 has p value of 0.0007 which is smaller then 0.05. Therefore, we can say that this is significant.
.
#anova
anova(crestlm)
Analysis of Variance Table
Response: Crest.Sales.Y
Df Sum Sq Mean Sq F value Pr(>F)
Crest.Budget.X1 1 2.5611e+10 2.5611e+10 279.392 1.23e-08 ***
Ratio.X2 1 1.0202e+09 1.0202e+09 11.130 0.0075401 **
Us.Personal.disposable.income.x3 1 2.1097e+09 2.1097e+09 23.015 0.0007265 ***
Residuals 10 9.1667e+08 9.1667e+07
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above Anova test, we can see that the P values of Crest Budget X1,ratio X2, and US personal Disposable incomes are smaller then 0.05. So we can say that all the values are significant and thus, all variables are important.
To check if any assumption is violated or not, we have to make visualize following plots and on the basis of them we will make our statement.
#assumptions
par(mfrow = c(2,2))
plot(crestlm)
In the above plot we can see 4 Diagnostic plots known as Residuals plot, QQ plots, Scale-location plot and leverage plots.
In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not exactly flat but points are kind of falling near the line only and the red line is nearby the line only. Therefore, we can say that that linearity assumption is not violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as all the points are falling on a straight line.
The next two shows that the regression is non-linear, non-constant variance.
To test further, if any assumptions are violated or not. We will perform some tests.
durbinWatsonTest(crestlm)
lag Autocorrelation D-W Statistic p-value
1 -0.1129573 2.211283 0.926
Alternative hypothesis: rho != 0
From the Durbin Waston test, we found out that p value is 0.834. This indictes positive autocorrelation in the model. P value is grater then 0.05, we can say that the model is insignificant and we fail to reject the null hypothesis(Ha).
shapiro.test(crestlm$residuals)
Shapiro-Wilk normality test
data: crestlm$residuals
W = 0.83777, p-value = 0.01522
From the Shapiro-wilk normality test, after observing the P value of 0.015. We can say that the normality is not violated here as p vale is smaller then 0.05. Hence, we can assume the normality as the distribution of data are not significantly different from normal distribution.
ncvTest(crestlm)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.07424735, Df = 1, p = 0.78525
Here, we can see that p value is greater then 0.05. therefore, we can say that the constant variance assumptions is violated here as we fail to reject the null hypothesis.
Here, we will use VIF function to check for multi collinearity.
#vif
vif(crestlm)
Crest.Budget.X1 Ratio.X2 Us.Personal.disposable.income.x3
11.421674 2.865841 7.443232
From the above VIF Functions, we can say that Crest Budget X1 shows Multi collinearity as it has a value of 11.42 which is greater then 10. For Ratio X2 we can say that there is no Multi collinearity as 2.86 is less then 10. For US Personal Disposable Income X3 shows value of 7.44 which is less than 10 but greater then 5. So, we can say that there maybe a issue of collinearity.
Now in this step, we will build and compare diffrent models.
#full model shoul contains all the variables
full=lm(Crest.Sales.Y ~., data=crest)
To create a backward step model, we have created a variable Full which contain all the variables. In Backward step, we will use Full variable, which will eliminate predictors one by one.
# null model contains no variable
null=lm(Crest.Sales.Y ~1, data=crest)
To create a Forward variable, we have created a null variable which doesn’t contain any variable. We will use this null variable in every step and we will add on one by one predictors in Forward step model.
#Backward elimination using AIC values
step(crestlm, data=crest, direction="backward")
Start: AIC=259.96
Crest.Sales.Y ~ Crest.Budget.X1 + Ratio.X2 + Us.Personal.disposable.income.x3
Df Sum of Sq RSS AIC
<none> 916674966 259.96
- Ratio.X2 1 158364757 1075039723 260.19
- Crest.Budget.X1 1 329414202 1246089168 262.26
- Us.Personal.disposable.income.x3 1 2109684588 3026359554 274.68
Call:
lm(formula = Crest.Sales.Y ~ Crest.Budget.X1 + Ratio.X2 + Us.Personal.disposable.income.x3,
data = crest)
Coefficients:
(Intercept) Crest.Budget.X1 Ratio.X2
34104.559 3.746 -30046.343
Us.Personal.disposable.income.x3
85.926
Here we have performed Backward elimination method to check the best small parsimonious model Using AIC approach.
Here, we can see that the AIC was 259.96 with ll the three varibles present. In this step, we found our best model in the first step only with the lowest AIC.
The co-efficient of the final model is 34104.559 for the intercept, 3.746 for crest Budget X1, -30046.343 for Ratio X2 and 85.926 for US personal disposable income X3 variables.
Now, We will build a model using forward step.
#forward selection using AIC values
step(null, scope=list(lower=null, upper=crestlm), direction="forward")
Start: AIC=302.64
Crest.Sales.Y ~ 1
Df Sum of Sq RSS AIC
+ Us.Personal.disposable.income.x3 1 2.8411e+10 1.2467e+09 260.27
+ Crest.Budget.X1 1 2.5611e+10 4.0466e+09 276.75
+ Ratio.X2 1 1.0637e+10 1.9020e+10 298.42
<none> 2.9658e+10 302.63
Step: AIC=260.27
Crest.Sales.Y ~ Us.Personal.disposable.income.x3
Df Sum of Sq RSS AIC
+ Crest.Budget.X1 1 171624035 1075039723 260.19
<none> 1246663758 260.27
+ Ratio.X2 1 574590 1246089168 262.26
Step: AIC=260.19
Crest.Sales.Y ~ Us.Personal.disposable.income.x3 + Crest.Budget.X1
Df Sum of Sq RSS AIC
+ Ratio.X2 1 158364757 916674966 259.96
<none> 1075039723 260.19
Step: AIC=259.96
Crest.Sales.Y ~ Us.Personal.disposable.income.x3 + Crest.Budget.X1 +
Ratio.X2
Call:
lm(formula = Crest.Sales.Y ~ Us.Personal.disposable.income.x3 +
Crest.Budget.X1 + Ratio.X2, data = crest)
Coefficients:
(Intercept) Us.Personal.disposable.income.x3 Crest.Budget.X1
34104.559 85.926 3.746
Ratio.X2
-30046.343
In this Forward step model building, we will start with AIC of 302.64 and we will add Us Disposable income X3 at first and after adding our AIC becomes 260.27. In the second step, we will add Crest Budget X1 and after adding that out AIC becomes 260.19. At the last, we will add our 3rd and last predictor i.e. Ratio X2 and after that AIC becomes 259.96
So our Final Model has AIC of 259.96. The Co-efficients of our final model is: * Intercept: 34104.559 * Us Personal Disposable Income X3: 85.926 * Crest Budget X1: 3.746 Ratio X2: -30046.343
Now, we will build a model using Step wise regression
#stepwise regression using AIC values
step(null, scope = list(upper=crestlm), data=Housing, direction="both")
Start: AIC=302.64
Crest.Sales.Y ~ 1
Df Sum of Sq RSS AIC
+ Us.Personal.disposable.income.x3 1 2.8411e+10 1.2467e+09 260.27
+ Crest.Budget.X1 1 2.5611e+10 4.0466e+09 276.75
+ Ratio.X2 1 1.0637e+10 1.9020e+10 298.42
<none> 2.9658e+10 302.63
Step: AIC=260.27
Crest.Sales.Y ~ Us.Personal.disposable.income.x3
Df Sum of Sq RSS AIC
+ Crest.Budget.X1 1 1.7162e+08 1.0750e+09 260.19
<none> 1.2467e+09 260.27
+ Ratio.X2 1 5.7459e+05 1.2461e+09 262.26
- Us.Personal.disposable.income.x3 1 2.8411e+10 2.9658e+10 302.63
Step: AIC=260.19
Crest.Sales.Y ~ Us.Personal.disposable.income.x3 + Crest.Budget.X1
Df Sum of Sq RSS AIC
+ Ratio.X2 1 158364757 916674966 259.96
<none> 1075039723 260.19
- Crest.Budget.X1 1 171624035 1246663758 260.27
- Us.Personal.disposable.income.x3 1 2971530267 4046569990 276.75
Step: AIC=259.96
Crest.Sales.Y ~ Us.Personal.disposable.income.x3 + Crest.Budget.X1 +
Ratio.X2
Df Sum of Sq RSS AIC
<none> 916674966 259.96
- Ratio.X2 1 158364757 1075039723 260.19
- Crest.Budget.X1 1 329414202 1246089168 262.26
- Us.Personal.disposable.income.x3 1 2109684588 3026359554 274.68
Call:
lm(formula = Crest.Sales.Y ~ Us.Personal.disposable.income.x3 +
Crest.Budget.X1 + Ratio.X2, data = crest)
Coefficients:
(Intercept) Us.Personal.disposable.income.x3 Crest.Budget.X1
34104.559 85.926 3.746
Ratio.X2
-30046.343
In this Step-wise regression model, we will start with AIC of 302.64 and then add Crest sales Y in our model. After that, we will add US personal disposable income X3 and our AIC becomes 260.27. Similarly we will keep on adding and subtracting variables, until at the end we found we found our AIC at lowest. In the last step, our AIC is 259.96 .
The Co_efficient are: * Intercept: 34104.559 * US Personal Income X3: 85.926 * Crest Budget X1: 3.746 * Ratio X2: -30046.343