Linear Regression

#1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

Intercept: The null hypothesis for the intercept is if that the expected value of sales is zero when all independent variables are zero.

TV: The null hypothesis is that there is no relationship between TV advertising expenditure and sales. In other words, changes in TV advertising have no effect on sales.

Radio: The null hypothesis is that there is no relationship between radio advertising expenditure and sales. In other words, changes in radio advertising have no effect on sales.

Newspaper: The null hypothesis is that there is no relationship between newspaper advertising expenditure and sales. In other words, changes in newspaper advertising have no effect on sales.

Intercept: The p-value is less than 0.0001, which is typically considered statistically significant. This means we can reject the null hypothesis for the intercept, but, as mentioned, this isn’t typically of substantive interest.

TV: The p-value is less than 0.0001, which means that it is very unlikely that we would observe such a relationship between TV advertising and sales due to random chance alone. We can reject the null hypothesis and conclude that there is a statistically significant relationship between TV advertising and sales.

Radio: Similarly, the p-value for radio is less than 0.0001, suggesting a statistically significant relationship between radio advertising and sales. We can reject the null hypothesis for radio.

Newspaper: The p-value for newspaper is 0.8599, which is not statistically significant at common significance levels (like 0.05). Therefore, we fail to reject the null hypothesis for newspaper, suggesting that there’s no evidence of a relationship between newspaper advertising and sales.

In conclusion, based on the p-values provided: TV advertising is significantly associated with sales. Radio advertising is significantly associated with sales. Newspaper advertising is not significantly associated with sales.

#3. Suppose we have a data set with five predictors, X1 =GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ˆβ0 = 50, ˆβ1 =20, ˆβ2 = 0.07, ˆβ3 = 35, ˆβ4 = 0.01, ˆβ5 = −10. Y= ˆβ0 +GPA* ˆβ1+IQ* ˆβ2+Gender* ˆβ3+GPAIQ ˆβ4+GPAGender ˆβ5 (a) Which answer is correct, and why? i. For a fixed value of IQ and GPA, males earn more on average than females. ii. For a fixed value of IQ and GPA, females earn more on average than males. iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

#iii) is correct For male the model will be Y= 50+20GPA+0.07IQ+0.01GPAIQ-10GPA For female, the model will be Y= 50+20GPA+35+0.07IQ+0.01IQGPA-10GPA =85+20GPA+0.07IQ+0.01GPAIQ-10GPA So, we just need to compare 85-10GPA and 50 based on these two models. So, we can see if GPA is high enough, 50 will be more than 85 -10*GPA, which means on average, males earn more than females.

Predict the salary of a female with IQ of 110 and a GPA of 4.0.

Y= 85+204+0.07110+0.014110-10*4=137.1 thousands

True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

False, P-value is more appropriate to determine if interaction term is statistically significant or not.

#6. Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (¯x, ¯y). #The simple linear regression is Y=^beta0 + ^{beta1X +error term We
put point (¯x, ¯y) into simple linear regression. Y= ¯y -
^beta1¯x+}beta1*¯x+ error term Predicted Y = ¯y, this shows the least squares line always passes through the point (¯x, ¯y).

#8. This question involves the use of simple linear regression on the Auto data set. (a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary () function to print the results. Comment on the output. For example: i. Is there a relationship between the predictor and the response? ii. How strong is the relationship between the predictor and the response? iii. Is the relationship between the predictor and the response. positive or negative? iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction? intervals?

setwd("/Users/tianchenxu/Desktop")
data<-read.csv("Auto.csv", header=T, na.strings="?")
Auto = na.omit(data)
summary(model<-lm(mpg~horsepower, data=Auto))

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

If our null hypothesis is there is no relationship between horsepower and mpg. Our p-value is extremely less than significance level 5%, so we can reject null hypothesis and conclude that there is relationship between mpg and horsepower.
The r-square is 0.6059, meaning 60.59% of the variance in mpg is explained by horsepower. This r-square is good enough to conclude they have a strong relationship. iii)Based one the slope -0.1578, we can see they have a negative relationship.

predict(model, data.frame(horsepower=c(98)), interval="confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(model, data.frame(horsepower=c(98)), interval="prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

#(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line

plot(Auto$horsepower, Auto$mpg)
abline(model)

#(c) Use the plot () function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

par(mfrow=c(2,2))  
plot(model)

Residuals vs Fitted: Indicates potential non-linearity. Normal Q-Q: Minor deviations suggest slight non-normality of residuals. Scale-Location: Funnel shape suggests non-constant variance (heteroscedasticity). Residuals vs Leverage: A few points suggest potential undue influence on the model.

In this problem we will investigate the t-statistic for the null hypothesis H0: β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.

set.seed (1)
x=rnorm (100)
y=2*x+rnorm (100)

Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate ˆβ, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H0: β = 0. Comment on these results. (You can perform regression without an intercept using the command lm(y∼x+0).)

summary(lm(y~x+0))

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate is 1.9939. The standard error is 0.1065. T-statistics is 18.73. P-value is 2.2e-16, way more less than 0.05 and it is close to 0. So, we can reject null hypothesis.

#(b) Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis H0: β = 0. Comment on these results.

summary(lm(x~y+0))

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate is 0.3911, standard error is 0.02089, t value is 18.73 and p-value is 2.2e-16. The p-value is near to 0, so we reject null β1 = 0 If I use 0.3911/0.02089=18.72, which is equal to the test statistics in question a).

#(c) What is the relationship between the results obtained in (a) and (b)? The t-test statistics results are the same in a) and b)

#(d) For the regression of Y onto X without an intercept, the t-statistic for H0: β = 0 takes the form ˆβ/SE (ˆ β), where ˆ β is given by (3.38)

(sqrt(length(x)-1) * sum(x*y)) / (sqrt(sum(x*x) * sum(y*y) - (sum(x*y))^2))

## [1] 18.72593

#e) Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y. In a) 1.9939/0.1065=18.73. In b) 0.39111/0.02089=18.73 If there’s a relationship between x and y, then there’s also a relationship between y and x. The strength of the linear relationship between the two variables is the same, regardless of which one is considered the response variable.

#f)In R, show that when regression is performed with an intercept, the t-statistic for H0: β1 = 0 is the same for the regression of y onto x as it is for the regression of x onto y.

# Linear regression of y onto x
model1 <- lm(y ~ x)
summary1 <- summary(model1)
t_stat_1 <- coef(summary1)["x", "t value"]
t_stat_1

## [1] 18.5556

# Linear regression of x onto y
model2 <- lm(x ~ y)
summary2 <- summary(model2)
t_stat_2 <- coef(summary2)["y", "t value"]
t_stat_2

## [1] 18.5556

This problem focuses on the collinearity problem. Perform the following commands in R:

set.seed (1)
x1=runif (100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients? Y=2+2X1+0.3X2+error term
β0=2, β1=2, β3=0.3

What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2)

#The correlation between x1 and x2 is 0.7392

Using this data, fit the least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ β0, ˆ β1, and ˆ β2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0: β1 = 0? How about the null hypothesis H0: β2 = 0?

summary(lm(y~x1+x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Y=1.4396x1+1.0097x2+2.1305 ˆβ0 is 2.1305, ˆβ1 is 1.4396, ˆβ2 is 1.0097 For null hypothesis H0: β1 = 0. We reject it because we have p value 0.0487 less than 0.05. For null hypothesis H0: β2 = 0. We fail to reject it because we have p-value 0.3754

Now fit the least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0: β1 = 0?

summary(lm(y~x1))

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The p-value of x1 is 2.66e-06. Significantly less than 5%, so we reject H0.

Now fit the least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0: β1 = 0?

summary(lm(y~x2))

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The p-value of x2 is 1.37e-05, very close to 0. So, we also reject H0.

Do the results obtained in (c)–(e) contradict each other? Explain your answer. No, since we cannot determine their sigificance if we do regression on x1 and x2 together. But when I did regression on these variables separately, it is clear for me to see their relationships with y, respectively.
Now suppose we obtain one additional observation, which was unfortunately mismeasured.

x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

summary(lm(y~x1+x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

model1<-lm(y~x1+x2)

summary(lm(y~x1))

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

model2<-lm(y~x1)

summary(lm(y~x2))

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

model3<-lm(y~x2)

par(mfrow=c(2,2))
plot(model1)

plot(model2)

plot(model3)

Based on the diagnostic plots of model1，observation “210” shows potential non-normality in residuals but isn’t overly influential in the regression model.

Based on the diagnostic plots of model2, observation “1010” has high leverage but is not an outlier and doesn’t overly influence the regression model. It doesn’t have an unusually high residual, so it isn’t considered an outlier based on this plot. The observation is evenly distributed, indicating homoscedasticity. “1010” lies far to the right, indicating it has high leverage.

Based on the diagnostic plots of model3, Point “210” appears to be an outlier since it deviates from the expected pattern in the residuals in multiple plots. Given its distance, it’s likely an influential point.Therefore, point “210” can be considered both an outlier and a high-leverage point.

Linear Regression

Tianchen Xu

2023-10-16