408 HW7 Blohm, Alex

Question 1 6.2

Part A

X = \(\begin{bmatrix} X_{11} & X_{12} & X_{11}^2\\ X_{21} & X_{22} & X_{21}^2\\ X_{31} & X_{32} & X_{31}^2\\ X_{41} & X_{42} & X_{41}^2\\ X_{51} & X_{52} & X_{51}^2\\ \end{bmatrix}\)

\(\beta\) = \(\begin{bmatrix} \beta _0\\ \beta _1\\ \beta _2\\ \end{bmatrix}\)

Part B

X = \(\begin{bmatrix} 1 & X_{11} & log_{10}X_{12}\\ 1 & X_{21} & log_{10}X_{22}\\ 1 & X_{31} & log_{10}X_{32}\\ 1 & X_{41} & log_{10}X_{42}\\ 1 & X_{51} & log_{10}X_{52}\\ \end{bmatrix}\)

\(\beta\) = \(\begin{bmatrix} \beta _0\\ \beta _1\\ \beta _2\\ \end{bmatrix}\)

Question 2

Part a. 6.5a.

Data <- read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%20%206%20Data%20Sets/CH06PR05.txt")
names(Data) = c("Brand_Liking", "Moisture","Sweetness")
n <- nrow(Data)  
plot(Data)

This plot shows a pretty good linear relationship between moisture and Brand Liking, but not much correlation between sweetness and brand liking. For the Brand Liking and Sweetness plot, I am guessing that Sweetness is categorical.

Part b. 6.5b.

fit_all<- lm(Brand_Liking~., data=Data)
summary(fit_all)

## 
## Call:
## lm(formula = Brand_Liking ~ ., data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## Moisture      4.4250     0.3011  14.695 1.78e-09 ***
## Sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

names(fit_all)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

fit_all$coefficients

## (Intercept)    Moisture   Sweetness 
##      37.650       4.425       4.375

\(\widehat{BrandLiking} = 37.65 + 4.425Moisture + 4.375Sweetness\)

or better written as \(\hat{Y} = 37.65 + 4.425X_1 + 4.375X_2\)

We interpret b1 as: Mean Brand Liking increases by 4.425 as Moisture increases by 1 unit and Sweetness stays constant.

Part c.

boxplot(fit_all$residuals, horizontal = TRUE)

This boxplot shows an approximately symmetric distribution of residuals with no outliers. We can assume that the residuals are normally distributed.

Part d

plot(fit_all$fitted.values,fit_all$residuals)

plot(fit_all,which=1)

This shows a slight arc shape pattern and suggests this may not be the best model. But the data looks to have a relatively constant variance (homoscedastic).

x1=Data[,2] #moisture content
x2=Data[,3] #Sweetness
y=Data[,1] #Brand Liking
plot(x1, fit_all$residuals, title("Moisture vs. residuals"))

This shows a slight arc pattern, but a decent scatter and approximately homoscedastic.

plot(x2, fit_all$residuals, title("Sweetness vs. residuals"))

This shows that all sweetness responses were 2 or 4 and therefore it is not equally spread.

plot(x1*x2, fit_all$residuals, title("Sweetness/Moisture Interaction vs. residuals"))

This shows a decent spread and constant variance. There might be slight arc shape (not larger residuals between 10 and 15).

plot(fit_all,which=2)

The normal probability plot shows a relatively normal distribution of residuals with a few outliers (14, 15, 11).

Part e. 6.6a.

\[H_0: \beta_1 = \beta_2 = 0 \]

\[H_a: At~least~1~\beta_k \neq 0~for~k = 1, 2 \]

summary(fit_all)

## 
## Call:
## lm(formula = Brand_Liking ~ ., data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## Moisture      4.4250     0.3011  14.695 1.78e-09 ***
## Sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

The F-Statistic is 129.1 and the p-value is almost 0 which is < our \(\alpha\) = 0.01. Therefore we reject the null and conclude that at least 1 \(\beta_k \neq 0\) for k = {1, 2}. This implies that \(\beta_1\) or \(\beta_2\) is significant (Moisture or Sweetness or both have a relationship with Brand Liking).

Part f. 6.6b.

Our p-value is \(2.658 * 10^{-9}\) which is almost 0.

Part g. 6.7a.

\(R^2\) = 0.9521 (the adjusted \(R^2\) = 0.9447). This means that there is a strong linear correlation between Brand Liking and the predictors moisture content and sweetness.

Matrix Method

This section is devoted to running the data again with Matrix Methods. Not meant to be graded.

Data <- read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%20%206%20Data%20Sets/CH06PR05.txt")
names(Data) = c("Brand_Liking", "Moisture","Sweetness")
n <- nrow(Data)  
plot(Data)  #Scatter plot matrix

cov(Data)  #Var-Cov Matrix

##              Brand_Liking  Moisture Sweetness
## Brand_Liking   131.133333 23.600000  4.666667
## Moisture        23.600000  5.333333  0.000000
## Sweetness        4.666667  0.000000  1.066667

cor(Data)

##              Brand_Liking  Moisture Sweetness
## Brand_Liking    1.0000000 0.8923929 0.3945807
## Moisture        0.8923929 1.0000000 0.0000000
## Sweetness       0.3945807 0.0000000 1.0000000

X <- as.matrix(cbind(rep(1,n),Data[,2:3]))  #Creating the X Matrix
p <- ncol(X)
colnames(X) = c("intercept", "Moisture","Sweetness")
Y <- as.matrix(Data[,1])  
I=diag(n)
b <- solve(t(X)%*%X)%*%t(X)%*%Y; b  #LSE b

##             [,1]
## intercept 37.650
## Moisture   4.425
## Sweetness  4.375

H <- X%*%solve(t(X)%*%X)%*%t(X)     #Hat matrix
hatY <- H%*%Y                       #Fitted 
SSE <- t(Y)%*%(I-H)%*%Y
MSE <- as.numeric(SSE/(n-p)); sqrt(MSE)

## [1] 2.693297

sb_all <- sqrt(MSE*diag(solve(t(X)%*%X)))
sb_all  # estimated sd for all beta

## intercept  Moisture Sweetness 
## 2.9961032 0.3011197 0.6733241

sb1 <- sb_all[2]   # The second one is the estimate sd for beta1
CI_b1 <- c(b[2]-qt(.975,n-p)*sb1,b[2]+qt(.975,n-p)*sb1)
CI_b1   # CI for beta1

## Moisture Moisture 
##  3.77447  5.07553

CI_b_all <- matrix(c(b-qt(.975,n-p)*sb_all,b+qt(.975,n-p)*sb_all),ncol=2)
CI_b_all   # CI for all beta

##           [,1]      [,2]
## [1,] 31.177312 44.122688
## [2,]  3.774470  5.075530
## [3,]  2.920372  5.829628

t=b[2]/sb1 # t-stat
p_val1 <- 2*(1-pt(abs(t),n-p)); p_val1    #Fail to reject H0: beta1=0

##     Moisture 
## 1.778029e-09

t_all = b/sb_all
p_val_all <- 2*(1-pt(abs(t_all),n-p)); p_val_all

##                   [,1]
## intercept 1.199637e-08
## Moisture  1.778029e-09
## Sweetness 2.011047e-05

Xh <- matrix(c(1,30,110,90),nrow=4) Yh <- t(Xh)%%b smean <- sqrt(MSE(t(Xh)%%solve(t(X)%%X)%%Xh)) CI_mean <- c(Yh - qt(0.975, n-p) smean, Yh + qt(0.975, n-p) * smean) #CI mean response CI_mean

spred <- sqrt(smean^2+MSE) PI <- c(Yh-qt(0.975, n-p) * spred, Yh+qt(0.975, n-p) * spred) # PI PI

J=rep(1,n)%%t(rep(1,n)) SSTO <- t(Y)%%(I-J/n)%%Y SSR <- t(Y)%%(H-J/n)%*%Y SSTO-SSE

F_val <- SSR/(p-1)/MSE ; F_val p_val1 <- 1-pf(F_val,p-1,n-p); p_val1

Rsq <- 1-SSE/SSTO; Rsq AdRsq <- 1- (n-1)/(n-4)*SSE/SSTO; AdRsq