Problem 1

#Read data
p088_data <- read.table('P088.txt',header = TRUE)
p088_data <- p088_data[,-c(1)]
head(p088_data)

##    Age   HS Income Black Female Price Sales
## 1 27.0 41.3   2948  26.2   51.7  42.7  89.8
## 2 22.9 66.7   4644   3.0   45.7  41.8 121.3
## 3 26.3 58.1   3665   3.0   50.8  38.5 115.2
## 4 29.1 39.9   2878  18.3   51.5  38.8 100.3
## 5 28.1 62.6   4493   7.0   50.8  39.7 123.0
## 6 26.2 63.9   3855   3.0   50.7  31.1 124.8

a)

To test if the variable $\text{Female}$ is not needed in the regression equation relating $\text{Sales}$ to the six predictor variables. We set up the hypothesis as follows:

\[H_0: \beta_{Female}=0~~~; H_a: \beta_{Female \ne 0}\]

#Build the linear regression model
model<- lm(Sales~., data= p088_data)
summary(model)

## 
## Call:
## lm(formula = Sales ~ ., data = p088_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.398 -12.388  -5.367   6.270 133.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.34485  245.60719   0.421  0.67597   
## Age           4.52045    3.21977   1.404  0.16735   
## HS           -0.06159    0.81468  -0.076  0.94008   
## Income        0.01895    0.01022   1.855  0.07036 . 
## Black         0.35754    0.48722   0.734  0.46695   
## Female       -1.05286    5.56101  -0.189  0.85071   
## Price        -3.25492    1.03141  -3.156  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2282 
## F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

From the output, we have that $\text{t_value}=\beta_{Female}/SE(\beta_{Female})=-0.189$ and its $\text{p-value}>0.05$. Therefore, we fail to reject $H_0$ and conclude that when the model has already included other variables, $\text{Female}$ variable is no longer needed.

b)

To test the hypothesis that the variables $\text{Female}$ and $\text{HS}$ are not needed in the above regression equation, we conduct the hypothesis:

\[H_0: \beta_{Female}=\beta_{HS}=0~~~; H_a: \text{At least one different from 0}\]

#Reduced model without Female and HS variable
model2 <- lm(Sales~.-Female-HS,data= p088_data)

anova(model2,model)

## Analysis of Variance Table
## 
## Model 1: Sales ~ (Age + HS + Income + Black + Female + Price) - Female - 
##     HS
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     46 34960                           
## 2     44 34926  2    33.799 0.0213 0.9789

To test for this hypothesis, we first create a new linear model without variables $\text{Female}$ and $\text{HS}$. Using the $\text{anova()}$ function, we can have the $\text{F-statistics}= 0.0213$ and $\text{p-value}=0.9789$. Therefore, we do not have enough evidence to reject $H_0$, which implies that the variables $\text{Female}$ and $\text{HS}$ are not statistically significant in predicting the $\text{Sales}$.

c) 95% confidence interval for the true regression coefficient of the variable Income.

We have the $\hat{\beta}_{Income}= 0.01895$ and $\text{SE}(\hat{\beta}_{Income})$ = 0.01022

b_income <- 0.01895
se_income <- 0.01022  
#Construct the upper bound 
UB <- b_income + qt(0.025,model$df.residual,lower.tail = FALSE)*se_income
#Construct the lower bound
LB <- b_income - qt(0.025,model$df.residual,lower.tail = FALSE)*se_income

conf_int_income <- c(LB,UB)

cat("The 95% CI for the true regression coefficient of the variable Income is: ", conf_int_income)

## The 95% CI for the true regression coefficient of the variable Income is:  -0.001647057 0.03954706

d) What is the coefficient of partial determination for Income related to Sales when the model has already included other five predictors Age, HS, Black, Female, and Price.

\[R^2_{Y3|12456}=\frac{SSR(X_3|X_1X_2X_4X_5X_6)}{SSE(X_1X_2X_4X_5X_6)}\]

We have $SSR(X_3|X_1X_2X_4X_5X_6)=SSE(X_1X_2X_4X_5X_6)- SSE(F)$

#Model without Income variable
model3<-lm(Sales~.-Income,data= p088_data)
anova(model3)

## Analysis of Variance Table
## 
## Response: Sales
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## Age        1   2640  2639.5  3.1543 0.082490 . 
## HS         1    412   412.3  0.4927 0.486318   
## Black      1   4145  4144.7  4.9530 0.031100 * 
## Female     1    197   196.8  0.2352 0.630019   
## Price      1   6376  6375.9  7.6194 0.008325 **
## Residuals 45  37656   836.8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Analysis of Variance Table, we have $SSE(X_1X_2X_4X_5X_6)=37656$ and $SSE(X_3)=45958$ and $SSE(F)=34926$.

\[R^2_{Y3|12456}=\frac{37656-34926}{37656}=0.0725\]

Problem 2

data_2 <- read.table("CH06PR05.txt")
colnames(data_2)<-c("brand_liking",'moisture','sweetness')
head(data_2)

##   brand_liking moisture sweetness
## 1           64        4         2
## 2           73        4         4
## 3           61        4         2
## 4           76        4         4
## 5           72        6         2
## 6           80        6         4

a) Relationship between Y and X1,X2

pairs(data_2)

From the scatter plot above, we can see there might be a positive correlation between $\text{moisture content}$ and $\text{brand liking}$. On the other hand, probably this is non-linear relationship between $\text{sweetness }$ and $\text{brand liking}$

b)

#Fit the model
problem2_model <- lm(brand_liking ~.,data= data_2)

# Summary of the model
summary(problem2_model)

## 
## Call:
## lm(formula = brand_liking ~ ., data = data_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## moisture      4.4250     0.3011  14.695 1.78e-09 ***
## sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

From the summary table, we can obtain the fitted model as follow:

\[\hat{Y}= 37.65 + 4.425.X_{\text{moisture}}+ 4.3750X_{\text{sweetness}}\]

We can also see that both $\text{Moisture}$ and $\text{Sweetness}$ variables are statistically significant based on $\text{p-value}<0.05$.

anova(problem2_model)

## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## moisture   1 1566.45 1566.45 215.947 1.778e-09 ***
## sweetness  1  306.25  306.25  42.219 2.011e-05 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can obtain from the ANOVA table that $\text{SSE}=94.30$ with $\text{df=13}$ and $\text{MSE}=7.25$.

From the ANOVA Table above, we can see that when there is only $\text{Moisture}$ in the variable, we have $SSR=1566.45$, $\text{F-value}=215.947$ and its $\text{p-value}$ is significant which shows that $\text{Moisture}$ in significant effect on the response variable. Next, when we add $\text{Sweetness}$ into the model, we have $SSR(\text{Sweetness}|\text{Moisture})=306.25$ and $\text{p-value}<0.05$. Hence, it suggests that $\beta_2$ is s significantly different from 0 given $\beta_1$.

Therefore, we can conclude for the hypothesis $H_0: \beta_1=\beta_2=0$ that we reject $H_0$, which implies that at least one or two variables are statistically significant.

From the linear regression output table, we have

$\beta_1=4.425$. This shows that for every unit increase in moisture content, the mean response of degree of brand liking with increases by 4.425. We have $p-value < 0.05$. Therefore, we have enough evidence to reject the $H_0: _1=0 $

$\beta_2=4.375$. This shows that for every unit increase in sweetness, the mean response of degree of brand liking with increases by 4.375. We also have $p-value < 0.05$. Therefore, we reject the hypothesis $H_0: \beta_2=0$.

Based on the ANOVA table above, we can drop the variable $\text{Sweetness}$ out while keeping the $\text{Moisture}$. However, we need to check if we can do in the inverse way, that drop $\text{Moisture}$ and keep $\text{Sweetness}$.

c)

par(mfrow=c(2,2))
#Plot ei and fitted Yi
residual_ei <- resid(problem2_model)
plot(fitted(problem2_model),residual_ei,xlab="Fitted Yi",ylab="Residual",main='residual plots of ei and fitted Yi')
#Plot ei and X1

plot(data_2$moisture,residual_ei,xlab="X1",ylab="Residual",main='residual plots of ei and X1')
#Plot ei and X2

plot(data_2$sweetness,residual_ei,xlab="X2",ylab="Residual",main='residual plots of ei and X2')

par(mfrow=c(1,2))
#Histogram plot of the Residuals
hist(residual_ei)

#Normality Probability plot of Residuals
qqnorm(residual_ei)
qqline(residual_ei)

From these plots, we can concluded that the residuals are normally distributed. Or we can say that this data is normally distribued.

d)

From the output, we have $R^2= 0.9521$ and $\text{Adjusted} ~R^2= 0.9447$.

e)

predict(problem2_model, newdata = list(moisture=5,sweetness=4), interval = 'confidence',level=0.95)

##      fit      lwr      upr
## 1 77.275 74.84094 79.70906

Given new $X_{moisture}=5$ and $X_{sweetness}=4$. The 95% confidence interval of E(Ynew) is (74.84094,79.70906)

f)

predict(problem2_model, newdata = list(moisture=5,sweetness=4), interval = 'prediction',level=0.95)

##      fit      lwr      upr
## 1 77.275 70.96788 83.58212

Given new $X_{moisture}=5$ and $X_{sweetness}=4$. The 95% confidence interval of new prediction Y is (70.96788,83.58212)

g)

To obtain $SSM(X1|X2)$, we can calculate by $SSE(X2)-SSE(X1,X2)$ or finding the ANOVA table. Since Anova Table Type I is sequential, we can reorder the position between two variables to have the $SSM(X_{moisture}|X_{sweetness})=1566.45$. This is the extra sum of squares, when we have the $SSE(X2)$ and $SSE(X1,X2)$, measures the marginal effect of adding $X1$ to the regression model when $X2$, is already in the model.

problem2_model2 <- lm(brand_liking~sweetness + moisture,data=data_2)
anova(problem2_model2)

## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## sweetness  1  306.25  306.25  42.219 2.011e-05 ***
## moisture   1 1566.45 1566.45 215.947 1.778e-09 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

h)

anova(problem2_model)

## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## moisture   1 1566.45 1566.45 215.947 1.778e-09 ***
## sweetness  1  306.25  306.25  42.219 2.011e-05 ***
## Residuals 13   94.30    7.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Anova Table above, we have $SSM(X2|X1)=306.25$ and $SSM(X1)=1566.45$. $SST0 = \frac{SSE}{1-R^2}=1968.06$.

And $SSM(X1,X2)=R^2.SST0=1872.53= SSM(X2|X1) + SSM(X1)$

i)

\[R^2_{Y1|2}=\frac{SSR(X_1|X_2)}{SSE(X_2)}\]

sweetness_model <- lm(brand_liking~sweetness,data=data_2)
anova(sweetness_model)

## Analysis of Variance Table
## 
## Response: brand_liking
##           Df  Sum Sq Mean Sq F value Pr(>F)
## sweetness  1  306.25  306.25  2.5817 0.1304
## Residuals 14 1660.75  118.62

We already had $SSR(X_1|X_2)=1566.45$ and $SSE(X2)=1660.75$. Then $R^2_{Y1|2}=0.9432184$ measures the proportionate reduction in the variation in Y remaining after $X1$ is included in the model that is gained by also including $X2$ in the model

Problem 3

data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

a)

problem3_model <- lm(Sepal.Length ~.-Species,data=iris)
summary(problem3_model)

## 
## Call:
## lm(formula = Sepal.Length ~ . - Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82816 -0.21989  0.01875  0.19709  0.84570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
## Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
## Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
## Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
## F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

We have $\hat{Y_i}=1.85600 + 0.65084.x_1+0.70913.x_2-0.55648.x_3$

b) Test the reduced model

\[H_0: B_1=B_2=0; ~~~; H_A: \text{At least one is different from 0}\]

reduced_model3 <- lm(Sepal.Length ~1,data=iris)
anova(reduced_model3,problem3_model)

## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ 1
## Model 2: Sepal.Length ~ (Sepal.Width + Petal.Length + Petal.Width + Species) - 
##     Species
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    149 102.168                                  
## 2    146  14.445  3    87.723 295.54 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA Table, we have $\text{F-test}=295.54$ and $\text{p-value} < 0.05$. Therefore, we conclude that we have enough evidence to reject $H_0$ and at least one has a significant effect on $\text{sepal length}$.

c)

\[H_0: \beta_1-\beta_2=0; ~~~; H_A: \beta_1-\beta_2 \ne 0\]

We need to find the distribution of $\beta_1-\beta_2$. We have $E(\beta_1-\beta_2)=\beta_1-\beta_2$ and $Var(\beta_1-\beta_2)=Var(\beta_1)+Var(\beta_2)-2Cov(\beta_1,\beta_2)$

#Residual standard error = 0.3145
mse <- 0.3145^2


X_matrix <- as.matrix(cbind(1,iris[,-c(1,5)]))

cov_beta <- mse * solve(t(X_matrix) %*% X_matrix)

expectation <- problem3_model$coefficients[2]-problem3_model$coefficients[3]

estimated_variance <- cov_beta[2,2]+ cov_beta[3,3]-2*cov_beta[2,3]

t_test <- expectation/sqrt(estimated_variance)

p_val <- 2* pt(t_test,146,lower.tail =TRUE)

cat("p-value for the hypothesis is:", p_val)

## p-value for the hypothesis is: 0.4277371

Since the $\text{p-value} > 0.05$,we fail to reject $H_0$

c)

\[H_0: \beta_1-\beta_2=0; ~~~; H_A: \beta_1-\beta_2 < 0\]

p_val <- pt(t_test,146,lower.tail =TRUE)

cat("p-value for the hypothesis is:", p_val)

## p-value for the hypothesis is: 0.2138686

Since the $\text{p-value} > 0.05$,we fail to reject $H_0$.

HW 4

Quoc Nguyen

2024-10-27