Problem 1

#Read data
p088_data <- read.table('P088.txt',header = TRUE)
p088_data <- p088_data[,-c(1)]
head(p088_data)
##    Age   HS Income Black Female Price Sales
## 1 27.0 41.3   2948  26.2   51.7  42.7  89.8
## 2 22.9 66.7   4644   3.0   45.7  41.8 121.3
## 3 26.3 58.1   3665   3.0   50.8  38.5 115.2
## 4 29.1 39.9   2878  18.3   51.5  38.8 100.3
## 5 28.1 62.6   4493   7.0   50.8  39.7 123.0
## 6 26.2 63.9   3855   3.0   50.7  31.1 124.8

a)

To test if the variable \(\text{Female}\) is not needed in the regression equation relating \(\text{Sales}\) to the six predictor variables. We set up the hypothesis as follows:

\[H_0: \beta_{Female}=0~~~; H_a: \beta_{Female \ne 0}\]

#Build the linear regression model
model<- lm(Sales~., data= p088_data)
summary(model)
## 
## Call:
## lm(formula = Sales ~ ., data = p088_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.398 -12.388  -5.367   6.270 133.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.34485  245.60719   0.421  0.67597   
## Age           4.52045    3.21977   1.404  0.16735   
## HS           -0.06159    0.81468  -0.076  0.94008   
## Income        0.01895    0.01022   1.855  0.07036 . 
## Black         0.35754    0.48722   0.734  0.46695   
## Female       -1.05286    5.56101  -0.189  0.85071   
## Price        -3.25492    1.03141  -3.156  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2282 
## F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

From the output, we have that \(\text{t_value}=\beta_{Female}/SE(\beta_{Female})=-0.189\) and its \(\text{p-value}>0.05\). Therefore, we fail to reject \(H_0\) and conclude that when the model has already included other variables, \(\text{Female}\) variable is no longer needed.

b)

To test the hypothesis that the variables \(\text{Female}\) and \(\text{HS}\) are not needed in the above regression equation, we conduct the hypothesis:

\[H_0: \beta_{Female}=\beta_{HS}=0~~~; H_a: \text{At least one different from 0}\]

#Reduced model without Female and HS variable
model2 <- lm(Sales~.-Female-HS,data= p088_data)

anova(model2,model)
## Analysis of Variance Table
## 
## Model 1: Sales ~ (Age + HS + Income + Black + Female + Price) - Female - 
##     HS
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     46 34960                           
## 2     44 34926  2    33.799 0.0213 0.9789

To test for this hypothesis, we first create a new linear model without variables \(\text{Female}\) and \(\text{HS}\). Using the \(\text{anova()}\) function, we can have the \(\text{F-statistics}= 0.0213\) and \(\text{p-value}=0.9789\). Therefore, we do not have enough evidence to reject \(H_0\), which implies that the variables \(\text{Female}\) and \(\text{HS}\) are not statistically significant in predicting the \(\text{Sales}\).

c) 95% confidence interval for the true regression coefficient of the variable Income.

We have the \(\hat{\beta}_{Income}= 0.01895\) and \(\text{SE}(\hat{\beta}_{Income})\) = 0.01022

b_income <- 0.01895
se_income <- 0.01022  
#Construct the upper bound 
UB <- b_income + qt(0.025,model$df.residual,lower.tail = FALSE)*se_income
#Construct the lower bound
LB <- b_income - qt(0.025,model$df.residual,lower.tail = FALSE)*se_income

conf_int_income <- c(LB,UB)

cat("The 95% CI for the true regression coefficient of the variable Income is: ", conf_int_income)
## The 95% CI for the true regression coefficient of the variable Income is:  -0.001647057 0.03954706

Problem 2

data_2 <- read.table("CH06PR05.txt")
colnames(data_2)<-c("brand_liking",'moisture','sweetness')
head(data_2)
##   brand_liking moisture sweetness
## 1           64        4         2
## 2           73        4         4
## 3           61        4         2
## 4           76        4         4
## 5           72        6         2
## 6           80        6         4

a) Relationship between Y and X1,X2

pairs(data_2)

From the scatter plot above, we can see there might be a positive correlation between \(\text{moisture content}\) and \(\text{brand liking}\). On the other hand, probably this is non-linear relationship between \(\text{sweetness }\) and \(\text{brand liking}\)

b)

#Fit the model
problem2_model <- lm(brand_liking ~.,data= data_2)

# Summary of the model
summary(problem2_model)
## 
## Call:
## lm(formula = brand_liking ~ ., data = data_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## moisture      4.4250     0.3011  14.695 1.78e-09 ***
## sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

From the summary table, we can obtain the fitted model as follow:

\[\hat{Y}= 37.65 + 4.425.X_{\text{moisture}}+ 4.3750X_{\text{sweetness}}\]

We can also see that both \(\text{Moisture}\) and \(\text{Sweetness}\) variables are statistically significant based on \(\text{p-value}<0.05\).

anova(problem2_model)
## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## moisture   1 1566.45 1566.45 215.947 1.778e-09 ***
## sweetness  1  306.25  306.25  42.219 2.011e-05 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can obtain from the ANOVA table that \(\text{SSE}=94.30\) with \(\text{df=13}\) and \(\text{MSE}=7.25\).

From the ANOVA Table above, we can see that when there is only \(\text{Moisture}\) in the variable, we have \(SSR=1566.45\), \(\text{F-value}=215.947\) and its \(\text{p-value}\) is significant which shows that \(\text{Moisture}\) in significant effect on the response variable. Next, when we add \(\text{Sweetness}\) into the model, we have \(SSR(\text{Sweetness}|\text{Moisture})=306.25\) and \(\text{p-value}<0.05\). Hence, it suggests that \(\beta_2\) is s significantly different from 0 given \(\beta_1\).

Therefore, we can conclude for the hypothesis \(H_0: \beta_1=\beta_2=0\) that we reject \(H_0\), which implies that at least one or two variables are statistically significant.

From the linear regression output table, we have

\(\beta_1=4.425\). This shows that for every unit increase in moisture content, the mean response of degree of brand liking with increases by 4.425. We have \(p-value < 0.05\). Therefore, we have enough evidence to reject the $H_0: _1=0 $

\(\beta_2=4.375\). This shows that for every unit increase in sweetness, the mean response of degree of brand liking with increases by 4.375. We also have \(p-value < 0.05\). Therefore, we reject the hypothesis \(H_0: \beta_2=0\).

Based on the ANOVA table above, we can drop the variable \(\text{Sweetness}\) out while keeping the \(\text{Moisture}\). However, we need to check if we can do in the inverse way, that drop \(\text{Moisture}\) and keep \(\text{Sweetness}\).

c)

par(mfrow=c(2,2))
#Plot ei and fitted Yi
residual_ei <- resid(problem2_model)
plot(fitted(problem2_model),residual_ei,xlab="Fitted Yi",ylab="Residual",main='residual plots of ei and fitted Yi')
#Plot ei and X1

plot(data_2$moisture,residual_ei,xlab="X1",ylab="Residual",main='residual plots of ei and X1')
#Plot ei and X2

plot(data_2$sweetness,residual_ei,xlab="X2",ylab="Residual",main='residual plots of ei and X2')

par(mfrow=c(1,2))
#Histogram plot of the Residuals
hist(residual_ei)

#Normality Probability plot of Residuals
qqnorm(residual_ei)
qqline(residual_ei)

From these plots, we can concluded that the residuals are normally distributed. Or we can say that this data is normally distribued.

d)

From the output, we have \(R^2= 0.9521\) and \(\text{Adjusted} ~R^2= 0.9447\).

e)

predict(problem2_model, newdata = list(moisture=5,sweetness=4), interval = 'confidence',level=0.95)
##      fit      lwr      upr
## 1 77.275 74.84094 79.70906

Given new \(X_{moisture}=5\) and \(X_{sweetness}=4\). The 95% confidence interval of E(Ynew) is (74.84094,79.70906)

f)

predict(problem2_model, newdata = list(moisture=5,sweetness=4), interval = 'prediction',level=0.95)
##      fit      lwr      upr
## 1 77.275 70.96788 83.58212

Given new \(X_{moisture}=5\) and \(X_{sweetness}=4\). The 95% confidence interval of new prediction Y is (70.96788,83.58212)

g)

To obtain \(SSM(X1|X2)\), we can calculate by \(SSE(X2)-SSE(X1,X2)\) or finding the ANOVA table. Since Anova Table Type I is sequential, we can reorder the position between two variables to have the \(SSM(X_{moisture}|X_{sweetness})=1566.45\). This is the extra sum of squares, when we have the \(SSE(X2)\) and \(SSE(X1,X2)\), measures the marginal effect of adding \(X1\) to the regression model when \(X2\), is already in the model.

problem2_model2 <- lm(brand_liking~sweetness + moisture,data=data_2)
anova(problem2_model2)
## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## sweetness  1  306.25  306.25  42.219 2.011e-05 ***
## moisture   1 1566.45 1566.45 215.947 1.778e-09 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

h)

anova(problem2_model)
## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## moisture   1 1566.45 1566.45 215.947 1.778e-09 ***
## sweetness  1  306.25  306.25  42.219 2.011e-05 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Anova Table above, we have \(SSM(X2|X1)=306.25\) and \(SSM(X1)=1566.45\). \(SST0 = \frac{SSE}{1-R^2}=1968.06\).

And \(SSM(X1,X2)=R^2.SST0=1872.53= SSM(X2|X1) + SSM(X1)\)

i)

\[R^2_{Y1|2}=\frac{SSR(X_1|X_2)}{SSE(X_2)}\]

sweetness_model <- lm(brand_liking~sweetness,data=data_2)
anova(sweetness_model)
## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value Pr(>F)
## sweetness  1  306.25  306.25  2.5817 0.1304
## Residuals 14 1660.75  118.62

We already had \(SSR(X_1|X_2)=1566.45\) and \(SSE(X2)=1660.75\). Then \(R^2_{Y1|2}=0.9432184\) measures the proportionate reduction in the variation in Y remaining after \(X1\) is included in the model that is gained by also including \(X2\) in the model

Problem 3

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

a)

problem3_model <- lm(Sepal.Length ~.-Species,data=iris)
summary(problem3_model)
## 
## Call:
## lm(formula = Sepal.Length ~ . - Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82816 -0.21989  0.01875  0.19709  0.84570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
## Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
## Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
## Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
## F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

We have \(\hat{Y_i}=1.85600 + 0.65084.x_1+0.70913.x_2-0.55648.x_3\)

b) Test the reduced model

\[H_0: B_1=B_2=0; ~~~; H_A: \text{At least one is different from 0}\]

reduced_model3 <- lm(Sepal.Length ~1,data=iris)
anova(reduced_model3,problem3_model)
## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ 1
## Model 2: Sepal.Length ~ (Sepal.Width + Petal.Length + Petal.Width + Species) - 
##     Species
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    149 102.168                                  
## 2    146  14.445  3    87.723 295.54 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA Table, we have \(\text{F-test}=295.54\) and \(\text{p-value} < 0.05\). Therefore, we conclude that we have enough evidence to reject \(H_0\) and at least one has a significant effect on \(\text{sepal length}\).

c)

\[H_0: \beta_1-\beta_2=0; ~~~; H_A: \beta_1-\beta_2 \ne 0\]

We need to find the distribution of \(\beta_1-\beta_2\). We have \(E(\beta_1-\beta_2)=\beta_1-\beta_2\) and \(Var(\beta_1-\beta_2)=Var(\beta_1)+Var(\beta_2)-2Cov(\beta_1,\beta_2)\)

#Residual standard error = 0.3145
mse <- 0.3145^2


X_matrix <- as.matrix(cbind(1,iris[,-c(1,5)]))

cov_beta <- mse * solve(t(X_matrix) %*% X_matrix)

expectation <- problem3_model$coefficients[2]-problem3_model$coefficients[3]

estimated_variance <- cov_beta[2,2]+ cov_beta[3,3]-2*cov_beta[2,3]

t_test <- expectation/sqrt(estimated_variance)

p_val <- 2* pt(t_test,146,lower.tail =TRUE)

cat("p-value for the hypothesis is:", p_val)
## p-value for the hypothesis is: 0.4277371

Since the \(\text{p-value} > 0.05\),we fail to reject \(H_0\)

c)

\[H_0: \beta_1-\beta_2=0; ~~~; H_A: \beta_1-\beta_2 < 0\]

p_val <- pt(t_test,146,lower.tail =TRUE)

cat("p-value for the hypothesis is:", p_val)
## p-value for the hypothesis is: 0.2138686

Since the \(\text{p-value} > 0.05\),we fail to reject \(H_0\).