\(X\) = \(\begin{bmatrix} X_{11} & X_{12} & {X}_{11}^2 \\ X_{21} & X_{22} & {X}_{21}^2 \\ X_{31} & X_{32} & {X}_{31}^2 \\ X_{41} & X_{42} & {X}_{41}^2 \\ X_{51} & X_{52} & {X}_{51}^2 \\ \end{bmatrix}\)
\(\beta\) = \(\begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \end{bmatrix}\)
\(X\) = \(\begin{bmatrix} 1 & X_{11} & {log}_{10}X_{12} \\ 1 & X_{21} & {log}_{10}X_{22} \\ 1 & X_{31} & {log}_{10}X_{32} \\ 1 & X_{41} & {log}_{10}X_{42} \\ 1 & X_{51} & {log}_{10}X_{52} \\ \end{bmatrix}\)
\(\beta\) = \(\begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{2} \\ \end{bmatrix}\)
From the diagnostics, we can see that Y and X1 have a correlation of 0.892 while Y and X2 have a correlation of 0.395. This implies that X1 may be a better predictor for Y.
ch6 <-read.table("CH06PR05.txt")
head(ch6)
## V1 V2 V3
## 1 64 4 2
## 2 73 4 4
## 3 61 4 2
## 4 76 4 4
## 5 72 6 2
## 6 80 6 4
names(ch6)
## [1] "V1" "V2" "V3"
names(ch6) <- c("Y", "X1", "X2")
library(GGally)
## Loading required package: ggplot2
ggpairs(ch6, columns = 1:3)
pairs(~Y+X1+X2,data=ch6,
main="Simple Scatterplot Matrix")
corr_matrix <- cor(ch6)
corr_matrix
## Y X1 X2
## Y 1.0000000 0.8923929 0.3945807
## X1 0.8923929 1.0000000 0.0000000
## X2 0.3945807 0.0000000 1.0000000
library(ggcorrplot)
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower",
lab = TRUE)
The estimated model is as follows
\[ Y_i = 37.650 + 4.4250X_1+4.3750X_2 \] In this case, \(b_1\) represents the variable \(X_1\) which is the moisture content. Keeping all other variables constant, when X1, the moisture, increases by 1 unit, the predicted brand liking increases by 1 unit.
fit <- lm(Y~X1+X2,data=ch6)
summary(fit)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = ch6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.400 -1.762 0.025 1.587 4.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.6500 2.9961 12.566 1.20e-08 ***
## X1 4.4250 0.3011 14.695 1.78e-09 ***
## X2 4.3750 0.6733 6.498 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9447
## F-statistic: 129.1 on 2 and 13 DF, p-value: 2.658e-09
The residuals against X2 = Sweetness do seem to have a pattern and are not randomly scattered. All the values for X2 are either 2 or 4 which is why the residuals look like this.
Based on the other residual plots, we can assume that the linearity assumption is met. The residuals are normally distributed and uncorrelated with any of the predictors.
Thus, the model assumptions hold.
boxplot(fit$residuals)
#Residuals against Y_hat = Brand Liking
library(ggplot2)
qplot(x=fit$residuals,
y=fit$model$Y,
xlab="Residuals",
ylab="Brand Liking")
#Residuals against X1 = Moisture Content
qplot(x=fit$residuals,
y=fit$model$X1,
xlab="Residuals",
ylab="Moisture Content")
#Residuals againt X2 = Sweetness
qplot(x=fit$residuals,
y=fit$model$X2,
xlab="Residuals",
ylab="Sweetness")
#Residuals against X1*X2
qplot(x=fit$residuals,
y=fit$model$X1*fit$model$X2,
xlab="Residuals",ylab="Moisture Content times Sweetness")
After viewing the QQ plot and conducting the Shapiro Wilk normality test, we fail to reject the \(H_0\) that the data is normal based on the p-value of 0.922 > \(\alpha =0.05\) an the relatively straight qq plot.
plot(fit)
shapiro.test(fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit$residuals
## W = 0.97585, p-value = 0.9222
Hypothesis Test:
\(H_0: \beta_{1}= \beta_{2} = \dots \beta{p-1} = 0\)
\(H_A:\) Atleast one \(\beta_{1}, \beta_{2}, \dots, \beta{p-1} \neq 0\)
Decision Rule:
If \(F* > F\), then reject \(H_0\).
Conclusion:
From our ANOVA table, we know that \(F*=129.1\) Below we calculated F=8.8615. Thus, since the values of \(F* > F\), we reject the null hypothesis that \(H_0: \beta_{1}= \beta_{2} = \dots \beta{p-1} = 0\). Additionally, the p-value below from the summary output is below \(\alpha = 0.01\)
This test implies that atleast one of B1 and B2 are not equal to 0 and there exists a linear relationship.
anova(fit)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X1 1 1566.45 1566.45 215.947 1.778e-09 ***
## X2 1 306.25 306.25 42.219 2.011e-05 ***
## Residuals 13 94.30 7.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = ch6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.400 -1.762 0.025 1.587 4.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.6500 2.9961 12.566 1.20e-08 ***
## X1 4.4250 0.3011 14.695 1.78e-09 ***
## X2 4.3750 0.6733 6.498 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9447
## F-statistic: 129.1 on 2 and 13 DF, p-value: 2.658e-09
a <- 0.01
n <- nrow(ch6)
F_test <- qf((1-a),1,n-2)
F_test
## [1] 8.861593
The p-value of as listed in the summary statistics is 2.658e-09.
The \(R^2\) value is 0.952 from the summary statistics of the model.