#1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
Intercept: The null hypothesis for the intercept is if that the expected value of sales is zero when all independent variables are zero.
TV: The null hypothesis is that there is no relationship between TV advertising expenditure and sales. In other words, changes in TV advertising have no effect on sales.
Radio: The null hypothesis is that there is no relationship between radio advertising expenditure and sales. In other words, changes in radio advertising have no effect on sales.
Newspaper: The null hypothesis is that there is no relationship between newspaper advertising expenditure and sales. In other words, changes in newspaper advertising have no effect on sales.
Intercept: The p-value is less than 0.0001, which is typically considered statistically significant. This means we can reject the null hypothesis for the intercept, but, as mentioned, this isn’t typically of substantive interest.
TV: The p-value is less than 0.0001, which means that it is very unlikely that we would observe such a relationship between TV advertising and sales due to random chance alone. We can reject the null hypothesis and conclude that there is a statistically significant relationship between TV advertising and sales.
Radio: Similarly, the p-value for radio is less than 0.0001, suggesting a statistically significant relationship between radio advertising and sales. We can reject the null hypothesis for radio.
Newspaper: The p-value for newspaper is 0.8599, which is not statistically significant at common significance levels (like 0.05). Therefore, we fail to reject the null hypothesis for newspaper, suggesting that there’s no evidence of a relationship between newspaper advertising and sales.
In conclusion, based on the p-values provided: TV advertising is significantly associated with sales. Radio advertising is significantly associated with sales. Newspaper advertising is not significantly associated with sales.
#3. Suppose we have a data set with five predictors, X1 =GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ˆβ0 = 50, ˆβ1 =20, ˆβ2 = 0.07, ˆβ3 = 35, ˆβ4 = 0.01, ˆβ5 = −10. Y= ˆβ0 +GPA* ˆβ1+IQ* ˆβ2+Gender* ˆβ3+GPAIQ ˆβ4+GPAGender ˆβ5 (a) Which answer is correct, and why? i. For a fixed value of IQ and GPA, males earn more on average than females. ii. For a fixed value of IQ and GPA, females earn more on average than males. iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.
#iii) is correct For male the model will be Y= 50+20GPA+0.07IQ+0.01GPAIQ-10GPA For female, the model will be Y= 50+20GPA+35+0.07IQ+0.01IQGPA-10GPA =85+20GPA+0.07IQ+0.01GPAIQ-10GPA So, we just need to compare 85-10GPA and 50 based on these two models. So, we can see if GPA is high enough, 50 will be more than 85 -10*GPA, which means on average, males earn more than females.
Y= 85+204+0.07110+0.014110-10*4=137.1 thousands
False, P-value is more appropriate to determine if interaction term is statistically significant or not.
#6. Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (¯x, ¯y). #The simple linear regression is Y=^beta0 + beta1X +error term We put point (¯x, ¯y) into simple linear regression. Y= ¯y - ^beta1¯x+beta1*¯x+ error term Predicted Y = ¯y, this shows the least squares line always passes through the point (¯x, ¯y).
#8. This question involves the use of simple linear regression on the Auto data set. (a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary () function to print the results. Comment on the output. For example: i. Is there a relationship between the predictor and the response? ii. How strong is the relationship between the predictor and the response? iii. Is the relationship between the predictor and the response. positive or negative? iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction? intervals?
setwd("/Users/tianchenxu/Desktop")
data<-read.csv("Auto.csv", header=T, na.strings="?")
Auto = na.omit(data)
summary(model<-lm(mpg~horsepower, data=Auto))
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
predict(model, data.frame(horsepower=c(98)), interval="confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(model, data.frame(horsepower=c(98)), interval="prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
#(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line
plot(Auto$horsepower, Auto$mpg)
abline(model)
#(c) Use the plot () function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow=c(2,2))
plot(model)
Residuals vs Fitted: Indicates potential non-linearity. Normal Q-Q:
Minor deviations suggest slight non-normality of residuals.
Scale-Location: Funnel shape suggests non-constant variance
(heteroscedasticity). Residuals vs Leverage: A few points suggest
potential undue influence on the model.
set.seed (1)
x=rnorm (100)
y=2*x+rnorm (100)
summary(lm(y~x+0))
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate is 1.9939. The standard error is 0.1065. T-statistics is 18.73. P-value is 2.2e-16, way more less than 0.05 and it is close to 0. So, we can reject null hypothesis.
#(b) Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis H0: β = 0. Comment on these results.
summary(lm(x~y+0))
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate is 0.3911, standard error is 0.02089, t value is 18.73 and p-value is 2.2e-16. The p-value is near to 0, so we reject null β1 = 0 If I use 0.3911/0.02089=18.72, which is equal to the test statistics in question a).
#(c) What is the relationship between the results obtained in (a) and (b)? The t-test statistics results are the same in a) and b)
#(d) For the regression of Y onto X without an intercept, the t-statistic for H0: β = 0 takes the form ˆβ/SE (ˆ β), where ˆ β is given by (3.38)
(sqrt(length(x)-1) * sum(x*y)) / (sqrt(sum(x*x) * sum(y*y) - (sum(x*y))^2))
## [1] 18.72593
#e) Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y. In a) 1.9939/0.1065=18.73. In b) 0.39111/0.02089=18.73 If there’s a relationship between x and y, then there’s also a relationship between y and x. The strength of the linear relationship between the two variables is the same, regardless of which one is considered the response variable.
#f)In R, show that when regression is performed with an intercept, the t-statistic for H0: β1 = 0 is the same for the regression of y onto x as it is for the regression of x onto y.
# Linear regression of y onto x
model1 <- lm(y ~ x)
summary1 <- summary(model1)
t_stat_1 <- coef(summary1)["x", "t value"]
t_stat_1
## [1] 18.5556
# Linear regression of x onto y
model2 <- lm(x ~ y)
summary2 <- summary(model2)
t_stat_2 <- coef(summary2)["y", "t value"]
t_stat_2
## [1] 18.5556
set.seed (1)
x1=runif (100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)
#The correlation between x1 and x2 is 0.7392
summary(lm(y~x1+x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Y=1.4396x1+1.0097x2+2.1305 ˆβ0 is 2.1305, ˆβ1 is 1.4396, ˆβ2 is 1.0097 For null hypothesis H0: β1 = 0. We reject it because we have p value 0.0487 less than 0.05. For null hypothesis H0: β2 = 0. We fail to reject it because we have p-value 0.3754
summary(lm(y~x1))
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The p-value of x1 is 2.66e-06. Significantly less than 5%, so we reject H0.
summary(lm(y~x2))
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The p-value of x2 is 1.37e-05, very close to 0. So, we also reject H0.
Do the results obtained in (c)–(e) contradict each other? Explain your answer. No, since we cannot determine their sigificance if we do regression on x1 and x2 together. But when I did regression on these variables separately, it is clear for me to see their relationships with y, respectively.
Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
summary(lm(y~x1+x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
model1<-lm(y~x1+x2)
summary(lm(y~x1))
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
model2<-lm(y~x1)
summary(lm(y~x2))
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
model3<-lm(y~x2)
par(mfrow=c(2,2))
plot(model1)
plot(model2)
plot(model3)
Based on the diagnostic plots of model1,observation “210” shows
potential non-normality in residuals but isn’t overly influential in the
regression model.
Based on the diagnostic plots of model2, observation “1010” has high leverage but is not an outlier and doesn’t overly influence the regression model. It doesn’t have an unusually high residual, so it isn’t considered an outlier based on this plot. The observation is evenly distributed, indicating homoscedasticity. “1010” lies far to the right, indicating it has high leverage.
Based on the diagnostic plots of model3, Point “210” appears to be an outlier since it deviates from the expected pattern in the residuals in multiple plots. Given its distance, it’s likely an influential point.Therefore, point “210” can be considered both an outlier and a high-leverage point.