#Read data
p088_data <- read.table('P088.txt',header = TRUE)
p088_data <- p088_data[,-c(1)]
head(p088_data)
## Age HS Income Black Female Price Sales
## 1 27.0 41.3 2948 26.2 51.7 42.7 89.8
## 2 22.9 66.7 4644 3.0 45.7 41.8 121.3
## 3 26.3 58.1 3665 3.0 50.8 38.5 115.2
## 4 29.1 39.9 2878 18.3 51.5 38.8 100.3
## 5 28.1 62.6 4493 7.0 50.8 39.7 123.0
## 6 26.2 63.9 3855 3.0 50.7 31.1 124.8
To test if the variable \(\text{Female}\) is not needed in the regression equation relating \(\text{Sales}\) to the six predictor variables. We set up the hypothesis as follows:
\[H_0: \beta_{Female}=0~~~; H_a: \beta_{Female \ne 0}\]
#Build the linear regression model
model<- lm(Sales~., data= p088_data)
summary(model)
##
## Call:
## lm(formula = Sales ~ ., data = p088_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.398 -12.388 -5.367 6.270 133.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.34485 245.60719 0.421 0.67597
## Age 4.52045 3.21977 1.404 0.16735
## HS -0.06159 0.81468 -0.076 0.94008
## Income 0.01895 0.01022 1.855 0.07036 .
## Black 0.35754 0.48722 0.734 0.46695
## Female -1.05286 5.56101 -0.189 0.85071
## Price -3.25492 1.03141 -3.156 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2282
## F-statistic: 3.464 on 6 and 44 DF, p-value: 0.006857
From the output, we have that \(\text{t_value}=\beta_{Female}/SE(\beta_{Female})=-0.189\) and its \(\text{p-value}>0.05\). Therefore, we fail to reject \(H_0\) and conclude that when the model has already included other variables, \(\text{Female}\) variable is no longer needed.
To test the hypothesis that the variables \(\text{Female}\) and \(\text{HS}\) are not needed in the above regression equation, we conduct the hypothesis:
\[H_0: \beta_{Female}=\beta_{HS}=0~~~; H_a: \text{At least one different from 0}\]
#Reduced model without Female and HS variable
model2 <- lm(Sales~.-Female-HS,data= p088_data)
anova(model2,model)
## Analysis of Variance Table
##
## Model 1: Sales ~ (Age + HS + Income + Black + Female + Price) - Female -
## HS
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 34960
## 2 44 34926 2 33.799 0.0213 0.9789
To test for this hypothesis, we first create a new linear model without variables \(\text{Female}\) and \(\text{HS}\). Using the \(\text{anova()}\) function, we can have the \(\text{F-statistics}= 0.0213\) and \(\text{p-value}=0.9789\). Therefore, we do not have enough evidence to reject \(H_0\), which implies that the variables \(\text{Female}\) and \(\text{HS}\) are not statistically significant in predicting the \(\text{Sales}\).
We have the \(\hat{\beta}_{Income}= 0.01895\) and \(\text{SE}(\hat{\beta}_{Income})\) = 0.01022
b_income <- 0.01895
se_income <- 0.01022
#Construct the upper bound
UB <- b_income + qt(0.025,model$df.residual,lower.tail = FALSE)*se_income
#Construct the lower bound
LB <- b_income - qt(0.025,model$df.residual,lower.tail = FALSE)*se_income
conf_int_income <- c(LB,UB)
cat("The 95% CI for the true regression coefficient of the variable Income is: ", conf_int_income)
## The 95% CI for the true regression coefficient of the variable Income is: -0.001647057 0.03954706
data_2 <- read.table("CH06PR05.txt")
colnames(data_2)<-c("brand_liking",'moisture','sweetness')
head(data_2)
## brand_liking moisture sweetness
## 1 64 4 2
## 2 73 4 4
## 3 61 4 2
## 4 76 4 4
## 5 72 6 2
## 6 80 6 4
pairs(data_2)
From the scatter plot above, we can see there might be a positive correlation between \(\text{moisture content}\) and \(\text{brand liking}\). On the other hand, probably this is non-linear relationship between \(\text{sweetness }\) and \(\text{brand liking}\)
#Fit the model
problem2_model <- lm(brand_liking ~.,data= data_2)
# Summary of the model
summary(problem2_model)
##
## Call:
## lm(formula = brand_liking ~ ., data = data_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.400 -1.762 0.025 1.587 4.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.6500 2.9961 12.566 1.20e-08 ***
## moisture 4.4250 0.3011 14.695 1.78e-09 ***
## sweetness 4.3750 0.6733 6.498 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9447
## F-statistic: 129.1 on 2 and 13 DF, p-value: 2.658e-09
From the summary table, we can obtain the fitted model as follow:
\[\hat{Y}= 37.65 + 4.425.X_{\text{moisture}}+ 4.3750X_{\text{sweetness}}\]
We can also see that both \(\text{Moisture}\) and \(\text{Sweetness}\) variables are statistically significant based on \(\text{p-value}<0.05\).
anova(problem2_model)
## Analysis of Variance Table
##
## Response: brand_liking
## Df Sum Sq Mean Sq F value Pr(>F)
## moisture 1 1566.45 1566.45 215.947 1.778e-09 ***
## sweetness 1 306.25 306.25 42.219 2.011e-05 ***
## Residuals 13 94.30 7.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can obtain from the ANOVA table that \(\text{SSE}=94.30\) with \(\text{df=13}\) and \(\text{MSE}=7.25\).
From the ANOVA Table above, we can see that when there is only \(\text{Moisture}\) in the variable, we have \(SSR=1566.45\), \(\text{F-value}=215.947\) and its \(\text{p-value}\) is significant which shows that \(\text{Moisture}\) in significant effect on the response variable. Next, when we add \(\text{Sweetness}\) into the model, we have \(SSR(\text{Sweetness}|\text{Moisture})=306.25\) and \(\text{p-value}<0.05\). Hence, it suggests that \(\beta_2\) is s significantly different from 0 given \(\beta_1\).
Therefore, we can conclude for the hypothesis \(H_0: \beta_1=\beta_2=0\) that we reject \(H_0\), which implies that at least one or two variables are statistically significant.
From the linear regression output table, we have
\(\beta_1=4.425\). This shows that for every unit increase in moisture content, the mean response of degree of brand liking with increases by 4.425. We have \(p-value < 0.05\). Therefore, we have enough evidence to reject the $H_0: _1=0 $
\(\beta_2=4.375\). This shows that for every unit increase in sweetness, the mean response of degree of brand liking with increases by 4.375. We also have \(p-value < 0.05\). Therefore, we reject the hypothesis \(H_0: \beta_2=0\).
Based on the ANOVA table above, we can drop the variable \(\text{Sweetness}\) out while keeping the \(\text{Moisture}\). However, we need to check if we can do in the inverse way, that drop \(\text{Moisture}\) and keep \(\text{Sweetness}\).
par(mfrow=c(2,2))
#Plot ei and fitted Yi
residual_ei <- resid(problem2_model)
plot(fitted(problem2_model),residual_ei,xlab="Fitted Yi",ylab="Residual",main='residual plots of ei and fitted Yi')
#Plot ei and X1
plot(data_2$moisture,residual_ei,xlab="X1",ylab="Residual",main='residual plots of ei and X1')
#Plot ei and X2
plot(data_2$sweetness,residual_ei,xlab="X2",ylab="Residual",main='residual plots of ei and X2')
par(mfrow=c(1,2))
#Histogram plot of the Residuals
hist(residual_ei)
#Normality Probability plot of Residuals
qqnorm(residual_ei)
qqline(residual_ei)
From these plots, we can concluded that the residuals are normally distributed. Or we can say that this data is normally distribued.
From the output, we have \(R^2= 0.9521\) and \(\text{Adjusted} ~R^2= 0.9447\).
predict(problem2_model, newdata = list(moisture=5,sweetness=4), interval = 'confidence',level=0.95)
## fit lwr upr
## 1 77.275 74.84094 79.70906
Given new \(X_{moisture}=5\) and \(X_{sweetness}=4\). The 95% confidence interval of E(Ynew) is (74.84094,79.70906)
predict(problem2_model, newdata = list(moisture=5,sweetness=4), interval = 'prediction',level=0.95)
## fit lwr upr
## 1 77.275 70.96788 83.58212
Given new \(X_{moisture}=5\) and \(X_{sweetness}=4\). The 95% confidence interval of new prediction Y is (70.96788,83.58212)
To obtain \(SSM(X1|X2)\), we can calculate by \(SSE(X2)-SSE(X1,X2)\) or finding the ANOVA table. Since Anova Table Type I is sequential, we can reorder the position between two variables to have the \(SSM(X_{moisture}|X_{sweetness})=1566.45\). This is the extra sum of squares, when we have the \(SSE(X2)\) and \(SSE(X1,X2)\), measures the marginal effect of adding \(X1\) to the regression model when \(X2\), is already in the model.
problem2_model2 <- lm(brand_liking~sweetness + moisture,data=data_2)
anova(problem2_model2)
## Analysis of Variance Table
##
## Response: brand_liking
## Df Sum Sq Mean Sq F value Pr(>F)
## sweetness 1 306.25 306.25 42.219 2.011e-05 ***
## moisture 1 1566.45 1566.45 215.947 1.778e-09 ***
## Residuals 13 94.30 7.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(problem2_model)
## Analysis of Variance Table
##
## Response: brand_liking
## Df Sum Sq Mean Sq F value Pr(>F)
## moisture 1 1566.45 1566.45 215.947 1.778e-09 ***
## sweetness 1 306.25 306.25 42.219 2.011e-05 ***
## Residuals 13 94.30 7.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the Anova Table above, we have \(SSM(X2|X1)=306.25\) and \(SSM(X1)=1566.45\). \(SST0 = \frac{SSE}{1-R^2}=1968.06\).
And \(SSM(X1,X2)=R^2.SST0=1872.53= SSM(X2|X1) + SSM(X1)\)
\[R^2_{Y1|2}=\frac{SSR(X_1|X_2)}{SSE(X_2)}\]
sweetness_model <- lm(brand_liking~sweetness,data=data_2)
anova(sweetness_model)
## Analysis of Variance Table
##
## Response: brand_liking
## Df Sum Sq Mean Sq F value Pr(>F)
## sweetness 1 306.25 306.25 2.5817 0.1304
## Residuals 14 1660.75 118.62
We already had \(SSR(X_1|X_2)=1566.45\) and \(SSE(X2)=1660.75\). Then \(R^2_{Y1|2}=0.9432184\) measures the proportionate reduction in the variation in Y remaining after \(X1\) is included in the model that is gained by also including \(X2\) in the model
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
problem3_model <- lm(Sepal.Length ~.-Species,data=iris)
summary(problem3_model)
##
## Call:
## lm(formula = Sepal.Length ~ . - Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82816 -0.21989 0.01875 0.19709 0.84570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
## Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
## Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
## Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared: 0.8586, Adjusted R-squared: 0.8557
## F-statistic: 295.5 on 3 and 146 DF, p-value: < 2.2e-16
We have \(\hat{Y_i}=1.85600 + 0.65084.x_1+0.70913.x_2-0.55648.x_3\)
\[H_0: B_1=B_2=0; ~~~; H_A: \text{At least one is different from 0}\]
reduced_model3 <- lm(Sepal.Length ~1,data=iris)
anova(reduced_model3,problem3_model)
## Analysis of Variance Table
##
## Model 1: Sepal.Length ~ 1
## Model 2: Sepal.Length ~ (Sepal.Width + Petal.Length + Petal.Width + Species) -
## Species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 149 102.168
## 2 146 14.445 3 87.723 295.54 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA Table, we have \(\text{F-test}=295.54\) and \(\text{p-value} < 0.05\). Therefore, we conclude that we have enough evidence to reject \(H_0\) and at least one has a significant effect on \(\text{sepal length}\).
\[H_0: \beta_1-\beta_2=0; ~~~; H_A: \beta_1-\beta_2 \ne 0\]
We need to find the distribution of \(\beta_1-\beta_2\). We have \(E(\beta_1-\beta_2)=\beta_1-\beta_2\) and \(Var(\beta_1-\beta_2)=Var(\beta_1)+Var(\beta_2)-2Cov(\beta_1,\beta_2)\)
#Residual standard error = 0.3145
mse <- 0.3145^2
X_matrix <- as.matrix(cbind(1,iris[,-c(1,5)]))
cov_beta <- mse * solve(t(X_matrix) %*% X_matrix)
expectation <- problem3_model$coefficients[2]-problem3_model$coefficients[3]
estimated_variance <- cov_beta[2,2]+ cov_beta[3,3]-2*cov_beta[2,3]
t_test <- expectation/sqrt(estimated_variance)
p_val <- 2* pt(t_test,146,lower.tail =TRUE)
cat("p-value for the hypothesis is:", p_val)
## p-value for the hypothesis is: 0.4277371
Since the \(\text{p-value} > 0.05\),we fail to reject \(H_0\)
\[H_0: \beta_1-\beta_2=0; ~~~; H_A: \beta_1-\beta_2 < 0\]
p_val <- pt(t_test,146,lower.tail =TRUE)
cat("p-value for the hypothesis is:", p_val)
## p-value for the hypothesis is: 0.2138686
Since the \(\text{p-value} > 0.05\),we fail to reject \(H_0\).