This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

As the price increases, the sales goes down -50 unit. And assuming all variables fixed, if the location is the U.S. the unit of Carseat sold goes up 1.2 unit.

For which of the predictors can you reject the null hypothesis H 0 : β j = 0?

Of course, Urban

attach(Carseats)
model <- lm(Carseats, formula = Sales ~ Price + Urban + US)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(e) On the basis of your response to the previous question, ﬁt a smaller model that only uses the predictors for which there is evidence of association with the outcome.

model2 <- update(model, Sales ~. - Urban)
summary(model2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) ﬁt the data?

Given the anova analysis, there is not much difference between model and model2.

anova(model, model2)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

As you see in the diagnostic plot, not much differnece here.

par(mfrow = c(2,2))
plot(model)

par(mfrow = c(2,2))
plot(model2)

But with the rules of Ocum’s razor, I would choose the simpler model, the first one. Also for the question e) yes. There is the sign of high leverage.

(g) Using the model from (e), obtain 95 % conﬁdence intervals for the coeﬃcient(s).

kable(confint(model2, level = 0.95))

	2.5 %	97.5 %
(Intercept)	11.7903202	14.2712653
Price	-0.0647598	-0.0441954
USYes	0.6915196	1.7077663

(a) Perform a simple linear regression of y onto x, without an intercept. Report the coeﬃcient estimate β, the standard error of this coeﬃcient estimate, and the t-statistic and p-value associated with the null hypothesis H 0 : β = 0. Comment on these results. (You can perform regression without an intercept using the command lm(y ∼ x+0).)

As you see, the coefficeint is significant.

set.seed (1)
x <- rnorm (100)
y <- 2* x+rnorm (100)
xy <- data.frame(cbind(x,y))
modxy <- lm(xy, formula = y ~ x + 0)
summary(modxy)

## 
## Call:
## lm(formula = y ~ x + 0, data = xy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

(b) Now perform a simple linear regression of x onto y without an intercept, and report the coeﬃcient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis H 0 : β = 0. Comment on these results.

modyx <- lm(xy, formula = x ~ y + 0)
summary(modyx)

## 
## Call:
## lm(formula = x ~ y + 0, data = xy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

What is the relationship between the results obtained in (a) and (b)?

Equation (b) is reverse function of (a)

dt <- data.frame(do.call(rbind, list(coef(modxy), coef(modyx))))
rownames(dt) <- c("x","y"); colnames(dt) <- "Coefficeints"
kable(dt)

	Coefficeints
x	1.9938761
y	0.3911145

g <- ggplot(data = xy, aes(x = x, y = y))
g + geom_point(aes(x=x,y=y))+ geom_line(aes(x = x, y = predict(modxy)))+ geom_line(aes(x = y, y = predict(modyx), colour = "red"))

a <- data.frame(summary(modxy)$coefficient)
b <- data.frame(summary(modyx)$coefficient)
kable(a)

	Estimate	Std..Error	t.value	Pr…t..
x	1.993876	0.1064767	18.72593	0

d

beta <- coef(modxy)
beta

##        x 
## 1.993876

sum(x^2)

## [1] 81.05509

den <- (length(y)-1)*sum(x^2)
num <- sum((y-x*beta)^2)        

sqrt(num/den)==a$Std..Error

## [1] FALSE

Tnum <- sqrt(length(y)-1)*sum(x*y)
Tden <- sqrt(sum(x^2)*sum(y^2) - sum(x*y)^2)
round(a$t.value,5) == round(Tnum/Tden,5)

## [1] TRUE

e

beta <- coef(modyx)
beta

##         y 
## 0.3911145

sum(y^2)

## [1] 413.2135

den <- (length(x)-1)*sum(y^2)
num <- sum((x-y*beta)^2)        
round(sqrt(num/den),5)==round(b$Std..Error,5)

## [1] TRUE

Tnum <- sqrt(length(x)-1)*sum(x*y)
Tden <- sqrt(sum(x^2)*sum(y^2) - sum(x*y)^2)
round(b$t.value,5) == round(Tnum/Tden,5)

## [1] TRUE

f

Same.

modxyI <- lm(xy, formula = y ~ x)
summary(modxyI)$coefficients

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.03769261 0.09698729 -0.3886346 6.983896e-01
## x            1.99893961 0.10772703 18.5555993 7.723851e-34

Ch4 - Prob 10-11

Bruce Kim

September 25, 2016