Problem 2 - KNNL 6.5-6.7

Part A. 6.5 a.

From the diagnostics, we can see that Y and X1 have a correlation of 0.892 while Y and X2 have a correlation of 0.395. This implies that X1 may be a better predictor for Y.

ch6 <-read.table("CH06PR05.txt")
head(ch6)

##   V1 V2 V3
## 1 64  4  2
## 2 73  4  4
## 3 61  4  2
## 4 76  4  4
## 5 72  6  2
## 6 80  6  4

names(ch6)

## [1] "V1" "V2" "V3"

names(ch6) <- c("Y", "X1", "X2")
library(GGally)

## Loading required package: ggplot2

ggpairs(ch6, columns = 1:3)

pairs(~Y+X1+X2,data=ch6, 
   main="Simple Scatterplot Matrix")

corr_matrix <- cor(ch6)
corr_matrix

##            Y        X1        X2
## Y  1.0000000 0.8923929 0.3945807
## X1 0.8923929 1.0000000 0.0000000
## X2 0.3945807 0.0000000 1.0000000

library(ggcorrplot)
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower",
   lab = TRUE)

Part B- 6.5 b.

The estimated model is as follows

\[ Y_i = 37.650 + 4.4250X_1+4.3750X_2 \] In this case, \(b_1\) represents the variable \(X_1\) which is the moisture content. Keeping all other variables constant, when X1, the moisture, increases by 1 unit, the predicted brand liking increases by 1 unit.

fit <- lm(Y~X1+X2,data=ch6)
summary(fit)

## 
## Call:
## lm(formula = Y ~ X1 + X2, data = ch6)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## X1            4.4250     0.3011  14.695 1.78e-09 ***
## X2            4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

Part C - Residuals

The residuals against X2 = Sweetness do seem to have a pattern and are not randomly scattered. All the values for X2 are either 2 or 4 which is why the residuals look like this.

Based on the other residual plots, we can assume that the linearity assumption is met. The residuals are normally distributed and uncorrelated with any of the predictors.

Thus, the model assumptions hold.

boxplot(fit$residuals)

#Residuals against Y_hat = Brand Liking
library(ggplot2)
qplot(x=fit$residuals,
      y=fit$model$Y,
      xlab="Residuals",
      ylab="Brand Liking")

#Residuals against X1 = Moisture Content
qplot(x=fit$residuals,
      y=fit$model$X1,
      xlab="Residuals",
      ylab="Moisture Content")

#Residuals againt X2 = Sweetness
qplot(x=fit$residuals,
      y=fit$model$X2,
      xlab="Residuals",
      ylab="Sweetness")

#Residuals against X1*X2
qplot(x=fit$residuals,
      y=fit$model$X1*fit$model$X2,
      xlab="Residuals",ylab="Moisture Content times Sweetness")

Part D. QQ Plot and Normality

After viewing the QQ plot and conducting the Shapiro Wilk normality test, we fail to reject the \(H_0\) that the data is normal based on the p-value of 0.922 > \(\alpha =0.05\) an the relatively straight qq plot.

plot(fit)

shapiro.test(fit$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residuals
## W = 0.97585, p-value = 0.9222

Part E. 6.6 a.

Hypothesis Test:

\(H_0: \beta_{1}= \beta_{2} = \dots \beta{p-1} = 0\)

\(H_A:\) Atleast one \(\beta_{1}, \beta_{2}, \dots, \beta{p-1} \neq 0\)

Decision Rule:

If \(F* > F\), then reject \(H_0\).

Conclusion:

From our ANOVA table, we know that \(F*=129.1\) Below we calculated F=8.8615. Thus, since the values of \(F* > F\), we reject the null hypothesis that \(H_0: \beta_{1}= \beta_{2} = \dots \beta{p-1} = 0\). Additionally, the p-value below from the summary output is below \(\alpha = 0.01\)

This test implies that atleast one of B1 and B2 are not equal to 0 and there exists a linear relationship.

anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## X1         1 1566.45 1566.45 215.947 1.778e-09 ***
## X2         1  306.25  306.25  42.219 2.011e-05 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit)

## 
## Call:
## lm(formula = Y ~ X1 + X2, data = ch6)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## X1            4.4250     0.3011  14.695 1.78e-09 ***
## X2            4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

a <- 0.01
n <- nrow(ch6)

F_test <- qf((1-a),1,n-2)
F_test

## [1] 8.861593

Part F

The p-value of as listed in the summary statistics is 2.658e-09.

Part G

The \(R^2\) value is 0.952 from the summary statistics of the model.

STAT 408 HW 7

Kajal Chokshi

10/28/2018

Problem 1 - KNNL 6.2

Part A.

Part B.