Q1. This question involves the use of simple linear regression on the Auto data set.
- Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output:
mydata <- read_xlsx("Auto.xlsx")
summary(mydata)
## mpg cylinders displacement
## Min. : 9.0 Min. :3.00 Min. : 68
## 1st Qu.:17.5 1st Qu.:4.00 1st Qu.:104
## Median :23.0 Median :4.00 Median :146
## Mean :23.5 Mean :5.46 Mean :194
## 3rd Qu.:29.0 3rd Qu.:8.00 3rd Qu.:262
## Max. :46.6 Max. :8.00 Max. :455
## horsepower weight acceleration
## Length:397 Min. :1613 Min. : 8.0
## Class :character 1st Qu.:2223 1st Qu.:13.8
## Mode :character Median :2800 Median :15.5
## Mean :2970 Mean :15.6
## 3rd Qu.:3609 3rd Qu.:17.1
## Max. :5140 Max. :24.8
## year origin
## Min. :70 Min. :1.00
## 1st Qu.:73 1st Qu.:1.00
## Median :76 Median :1.00
## Mean :76 Mean :1.57
## 3rd Qu.:79 3rd Qu.:2.00
## Max. :82 Max. :3.00
#Changing horsepower into quantitative variable
mydata$horsepower <- as.numeric(mydata$horsepower)
model <- lm(mpg ~ horsepower, data=mydata)
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.571 -3.259 -0.344 2.763 16.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.93586 0.71750 55.7 <2e-16 ***
## horsepower -0.15784 0.00645 -24.5 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.91 on 390 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.606, Adjusted R-squared: 0.605
## F-statistic: 600 on 1 and 390 DF, p-value: <2e-16
- Is there a relationship between the predictor and the response in the sample?
Yes. From the above summary, we can see that the p-value is less than the significance level of 0.05 which is p-value: <2e-16.
- Is there a relationship between the predictor and the response in the population?
We can see from the regression plot that there is a negative linear relationship between predictor variable horsepower and the response variable mpg. Also, from the regression summary, we can see that the p-value is less than the significance level of 0.05, so we reject the null hypothesis and conclude that there is a linear relationship between horsepower and mpg in the population.
library(ggplot2)
ggplot(mydata, mapping = aes(x = horsepower, y=mpg)) + geom_point() + geom_smooth(method='lm', se=FALSE)
- Describe the relationship between the predictor and the response based on the slope?
The slope of the regression line is -0.15784. This means that there is a negative linear relationship between the predictor (horsepower) and the response (mpg) variables. The model predicts that the mpg will decrease by 0.15784 for each additional horsepower.
- What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?
The predicted mpg associated with a horsepower of 98 is 24.5. The associated 95% confidence interval is (24, 25) and the associated 95% prediction interval is (14.8, 34.1)
new<-data.frame(horsepower=c(98))
predict(object= model, interval="confidence", level=0.95, se.fit = T, newdata=new )
## $fit
## fit lwr upr
## 1 24.5 24 25
##
## $se.fit
## [1] 0.251
##
## $df
## [1] 390
##
## $residual.scale
## [1] 4.91
predict(object= model, interval="prediction", level=0.95, se.fit=T, newdata=new)
## $fit
## fit lwr upr
## 1 24.5 14.8 34.1
##
## $se.fit
## [1] 0.251
##
## $df
## [1] 390
##
## $residual.scale
## [1] 4.91
- Plot the response and the predictor. Use the geom_smooth() function to display the least squares regression line.
library(ggplot2)
ggplot(mydata, mapping = aes(x = horsepower, y=mpg)) + geom_point() + geom_smooth(method='lm', se=FALSE)
- Plot the 95% confidence and prediction bands. Explain the difference between them (Which one is wider, and why?).
95% prediction band is larger than 95% confidence band. This is because the prediction interval has an extra mean square error (MSE) compared to confidence interval.
library(ggplot2)
pred <- predict(object= model, interval="prediction", level=0.95, newdata=mydata )
df_new <- cbind(mydata, pred)
head(df_new)
## mpg cylinders displacement horsepower weight acceleration
## 1 18 8 307 130 3504 12.0
## 2 15 8 350 165 3693 11.5
## 3 18 8 318 150 3436 11.0
## 4 16 8 304 150 3433 12.0
## 5 17 8 302 140 3449 10.5
## 6 15 8 429 198 4341 10.0
## year origin fit lwr upr
## 1 70 1 19.42 9.75 29.1
## 2 70 1 13.89 4.20 23.6
## 3 70 1 16.26 6.58 25.9
## 4 70 1 16.26 6.58 25.9
## 5 70 1 17.84 8.17 27.5
## 6 70 1 8.68 -1.05 18.4
ggplot(df_new, mapping = aes(x = horsepower, y=mpg)) + geom_point() + geom_smooth(method='lm', se=TRUE)+
geom_line(aes(x=horsepower, y=lwr), color="red", linetype="dashed")+
geom_line(aes(x=horsepower, y=upr), color="red", linetype="dashed")
Q2. Formulate the following regression function in matrix notation. If you plan use Latex for the equations, it must be readable. Unreadable answers will be assumed to be incorrect. If you finish the question in handwriting, please take a quality picture of your work and upload them. Illegible answers will be assumed to be incorrect.
\(\begin{equation} y_{i}=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\varepsilon_{i}. \text {for } i=1, ... , n\end{equation}\)
Q3 This question involves the use of multiple linear regression on the Credit data set.
- Fit a multiple regression model relating balance (average credit card debt for each individual) to age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating).
mydata3 <- read_xlsx("Credit.xlsx")
head(mydata3)
## # A tibble: 6 x 11
## Income Limit Rating Cards Age Education Gender Student
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 14.9 3606 283 2 34 11 Male No
## 2 106. 6645 483 3 82 15 Female Yes
## 3 105. 7075 514 4 71 11 Male No
## 4 149. 9504 681 3 36 11 Female No
## 5 55.9 4897 357 2 68 16 Male No
## 6 80.2 8047 569 4 77 10 Male No
## # … with 3 more variables: Married <chr>, Ethnicity <chr>,
## # Balance <dbl>
model <- lm(Balance ~ Age + Cards + Education + Income + Limit + Rating, data=mydata3, x=T)
summary(model)
##
## Call:
## lm(formula = Balance ~ Age + Cards + Education + Income + Limit +
## Rating, data = mydata3, x = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.2 -113.2 -42.1 45.8 543.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -477.958 55.065 -8.68 <2e-16 ***
## Age -0.892 0.478 -1.87 0.0627 .
## Cards 11.592 7.067 1.64 0.1017
## Education 1.998 2.600 0.77 0.4426
## Income -7.558 0.382 -19.77 <2e-16 ***
## Limit 0.126 0.053 2.37 0.0181 *
## Rating 2.063 0.794 2.60 0.0097 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162 on 393 degrees of freedom
## Multiple R-squared: 0.878, Adjusted R-squared: 0.876
## F-statistic: 472 on 6 and 393 DF, p-value: <2e-16
- For your model in part a (suppose we write it as \(Y=X\beta+\epsilon\)), find \(X\), \(Y\), and MSE
X<-model$x
head(X)
## (Intercept) Age Cards Education Income Limit Rating
## 1 1 34 2 11 14.9 3606 283
## 2 1 82 3 15 106.0 6645 483
## 3 1 71 4 11 104.6 7075 514
## 4 1 36 3 11 148.9 9504 681
## 5 1 68 2 16 55.9 4897 357
## 6 1 77 4 10 80.2 8047 569
Y<-mydata3$Balance
Y
## [1] 333 903 580 964 331 1151 203 872 279 1350 1407
## [12] 0 204 1081 148 0 0 368 891 1048 89 968
## [23] 0 411 0 671 654 467 1809 915 863 0 526
## [34] 0 0 419 762 1093 531 344 50 1155 385 976
## [45] 1120 997 1241 797 0 902 654 211 607 957 0
## [56] 0 379 133 333 531 631 108 0 133 0 602
## [67] 1388 889 822 1084 357 1103 663 601 945 29 532
## [78] 145 391 0 162 99 503 0 0 1779 815 0
## [89] 579 1176 1023 812 0 937 0 0 1380 155 375
## [100] 1311 298 431 1587 1050 745 210 0 0 227 297
## [111] 47 0 1046 768 271 510 0 1341 0 0 0
## [122] 454 904 0 0 0 1404 0 1259 255 868 0
## [133] 912 1018 835 8 75 187 0 1597 1425 605 669
## [144] 710 68 642 805 0 0 0 581 534 156 0
## [155] 0 0 429 1020 653 0 836 0 1086 0 548
## [166] 570 0 0 0 1099 0 283 108 724 1573 0
## [177] 0 384 453 1237 423 516 789 0 1448 450 188
## [188] 0 930 126 538 1687 336 1426 0 802 749 69
## [199] 0 571 829 1048 0 1411 456 638 0 1216 230
## [210] 732 95 799 308 637 681 246 52 955 195 653
## [221] 1246 1230 1549 573 701 1075 1032 482 156 1058 661
## [232] 657 689 0 1329 191 489 443 52 163 148 0
## [243] 16 856 0 0 199 0 0 98 0 132 1355
## [254] 218 1048 118 0 0 0 1092 345 1050 465 133
## [265] 651 549 15 942 0 772 136 436 728 1255 967
## [276] 529 209 531 250 269 541 0 1298 890 0 0
## [287] 0 0 863 485 159 309 481 1677 0 0 293
## [298] 188 0 711 580 172 295 414 905 0 70 0
## [309] 681 885 1036 844 823 843 1140 463 1142 136 0
## [320] 0 5 81 265 1999 415 732 1361 984 121 846
## [331] 1054 474 380 182 594 194 926 0 606 1107 320
## [342] 426 204 410 633 0 907 1192 0 503 0 302
## [353] 583 425 413 1405 962 0 347 611 712 382 710
## [364] 578 1243 790 1264 216 345 1208 992 0 840 1003
## [375] 588 1000 767 0 717 0 661 849 1352 382 0
## [386] 905 371 0 1129 806 1393 721 0 0 734 560
## [397] 480 138 0 966
MSE <- sum(residuals(model)^2) / model$df.residual
MSE
## [1] 26129
- Find least squares estimation \(b\) with \(X\) and \(Y\). How does this agrees with your results in part a?
b0 = -477.958, b1 =-0.892, b2=11.592, b3=1.998, b4=-7.558, b5=0.126, b6=2.063. These values are consistent with Coefficients from summary(model).
t(X) %*% X
## (Intercept) Age Cards Education Income
## (Intercept) 400 2.23e+04 1183 5380 1.81e+04
## Age 22267 1.36e+06 66260 299569 1.05e+06
## Cards 1183 6.63e+04 4249 15824 5.31e+04
## Education 5380 3.00e+05 15824 76258 2.42e+05
## Income 18088 1.05e+06 53142 242061 1.31e+06
## Limit 1894240 1.07e+08 5615136 25409750 1.11e+08
## Rating 141976 8.01e+06 424401 1903763 8.14e+06
## Limit Rating
## (Intercept) 1.89e+06 1.42e+05
## Age 1.07e+08 8.01e+06
## Cards 5.62e+06 4.24e+05
## Education 2.54e+07 1.90e+06
## Income 1.11e+08 8.14e+06
## Limit 1.11e+10 8.14e+08
## Rating 8.14e+08 5.99e+07
t(X) %*% Y
## [,1]
## (Intercept) 2.08e+05
## Age 1.16e+07
## Cards 6.37e+05
## Education 2.79e+06
## Income 1.24e+07
## Limit 1.35e+09
## Rating 9.83e+07
b_LeastSquareEstimation<-solve(t(X) %*% X) %*% (t(X) %*% Y)
b_LeastSquareEstimation
## [,1]
## (Intercept) -477.958
## Age -0.892
## Cards 11.592
## Education 1.998
## Income -7.558
## Limit 0.126
## Rating 2.063
- Estimate \(\sigma^2\) and \(\sigma\)
sum(residuals(model)^2 / model$df.residual)
## [1] 26129
sqrt(sum(residuals(model)^2 / model$df.residual))
## [1] 162
- Find the variance-covariance matrix of the regression parameters.
XTXinv <- solve(t(X) %*% X)
MSE*XTXinv
## (Intercept) Age Cards Education
## (Intercept) 3032.19 -1.16e+01 -20.277 -96.72525
## Age -11.64 2.29e-01 -0.139 -0.01381
## Cards -20.28 -1.39e-01 49.938 0.13136
## Education -96.73 -1.38e-02 0.131 6.75889
## Income 3.26 -2.91e-02 0.219 0.01378
## Limit 1.07 2.12e-04 0.202 -0.00914
## Rating -17.42 -4.91e-04 -3.062 0.13765
## Income Limit Rating
## (Intercept) 3.262661 1.071959 -1.74e+01
## Age -0.029054 0.000212 -4.91e-04
## Cards 0.218834 0.201771 -3.06e+00
## Education 0.013777 -0.009140 1.38e-01
## Income 0.146207 -0.000333 -2.12e-02
## Limit -0.000333 0.002813 -4.19e-02
## Rating -0.021164 -0.041876 6.31e-01
- Find standard errors of the regression parameters.
sqrt(MSE*diag(XTXinv))
## (Intercept) Age Cards Education Income
## 55.065 0.478 7.067 2.600 0.382
## Limit Rating
## 0.053 0.794
- Visualize the correlation matrix of regression parameters. What can you say about the correlation between Rating and Limit?
corrplot(cov2cor(MSE*XTXinv))
- Find b2 = \((X'X+I)^{-1} X'Y\). (Generate \(I\) as I <- diag(dim(t(X) %*% X)[1]) ). Does this agrees with your results in part a?
No, it doesn’t agree with your results in part a since we get different values. The values are not consistent with Coefficients from summary(model).
I <- diag(dim(t(X) %*% X)[1])
b2 <- solve(t(X) %*% X + I) %*% (t(X) %*% Y)
b2
## [,1]
## (Intercept) -428.250
## Age -1.083
## Cards 11.238
## Education 0.413
## Income -7.505
## Limit 0.143
## Rating 1.779
Q4 This question involves the use of multiple linear regression on the Auto data set.
- Produce a scatterplot matrix which includes all of the variables in the data set (hint: You may need to summary the dataset with summary(). Make sure all the variables are coded as quantitative variables. You may need to change the non-quantitative variables into quantitative with as.numeric()).
library(dplyr)
mydata4<- read_xlsx("Auto.xlsx")
head(mydata4)
## # A tibble: 6 x 8
## mpg cylinders displacement horsepower weight
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 18 8 307 130 3504
## 2 15 8 350 165 3693
## 3 18 8 318 150 3436
## 4 16 8 304 150 3433
## 5 17 8 302 140 3449
## 6 15 8 429 198 4341
## # … with 3 more variables: acceleration <dbl>, year <dbl>,
## # origin <dbl>
#Changing horsepower into quantitative variable
mydata4$horsepower <- as.numeric(mydata$horsepower)
ggpairs(mydata4)
- Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables as the predictors. Use the summary() function to print the results. Comment on the output:
model <- lm(mpg ~ cylinders + displacement + horsepower +weight + acceleration + origin + year, data=mydata4, x=T)
summary(model)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + origin + year, data = mydata4, x = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.590 -2.157 -0.117 1.869 13.060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.72e+01 4.64e+00 -3.71 0.00024 ***
## cylinders -4.93e-01 3.23e-01 -1.53 0.12780
## displacement 1.99e-02 7.51e-03 2.65 0.00844 **
## horsepower -1.70e-02 1.38e-02 -1.23 0.21963
## weight -6.47e-03 6.52e-04 -9.93 < 2e-16 ***
## acceleration 8.06e-02 9.88e-02 0.82 0.41548
## origin 1.43e+00 2.78e-01 5.13 4.7e-07 ***
## year 7.51e-01 5.10e-02 14.73 < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.33 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.821, Adjusted R-squared: 0.818
## F-statistic: 252 on 7 and 384 DF, p-value: <2e-16
- How much percent of the variation in response is reduced by taking into account the predictors?
The value of Multiple R-squared is 0.821 and the value of Adjusted R-squared is 0.818. 82.1 % of the variation in response id reduced by taking into account the predictors.
- For which of the predictors can you reject the null hypothesis \(H_0 :\beta_{j} =0\)?
We can reject the null hypothesis for the predictors that have a p-value less than 0.05. I have listed the predictors below for whom we can reject the null hypothesis.
weight
displacement
origin
year
- What does the coefficient for the year variable suggest?
The variable year has a coefficient of 7.70e-01 which is equal to 0.77. It has a positive coefficient which indicates that as the value of the year variable increases, the mean of the mpg variable also tends to increase.
- On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which you can reject the null hypothesis \(H_0 :\beta_{j} =0\).
model1<-lm(formula = mpg ~ displacement + weight +
origin + year, data = mydata4, x = T)
summary(model1)
##
## Call:
## lm(formula = mpg ~ displacement + weight + origin + year, data = mydata4,
## x = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.831 -2.119 -0.008 1.807 13.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.88e+01 3.99e+00 -4.72 3.3e-06 ***
## displacement 6.09e-03 4.76e-03 1.28 0.2
## weight -6.66e-03 5.56e-04 -11.98 < 2e-16 ***
## origin 1.23e+00 2.65e-01 4.65 4.5e-06 ***
## year 7.76e-01 4.95e-02 15.70 < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.35 on 392 degrees of freedom
## Multiple R-squared: 0.819, Adjusted R-squared: 0.817
## F-statistic: 443 on 4 and 392 DF, p-value: <2e-16
- Find the adjusted \(R^2\) for the model in part (c). Compare this value to the adjusted \(R^2\) for the model in part (a). Discuss your results.
The value of adjusted R^2 for the model in part (c) is 0.817 which is slightly less than that of part (a) by 0.001. We can say that both have values similar to each other.
- Using the model from (c), obtain 95% confidence intervals for the coefficient(s).
The confidence interval for displacement is (-0.00326, 0.01544), weight is (-0.00775, -0.00556), origin is (0.71286,1.75622), and year is (0.67922, 0.87368).
confint(object=model1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -26.68915 -10.99282
## displacement -0.00326 0.01544
## weight -0.00775 -0.00556
## origin 0.71286 1.75622
## year 0.67922 0.87368
Q5 The weight and systolic blood pressure of 26 randomly selected males in the age group 25–30 are shown in SBP.xlsx
- Estimate and explain the correlation coefficient.
The correlation coefficient we get is 0.773. The positive sign of r tells us that the relationship is positive as weight increases, systolic blood pressure increases too.
mydata5 <- read_xlsx("SBP.xlsx")
head(mydata5)
## # A tibble: 6 x 2
## weight sbp
## <dbl> <dbl>
## 1 165 130
## 2 167 133
## 3 180 150
## 4 155 128
## 5 212 151
## 6 175 146
cor(mydata5$weight, mydata5$sbp)
## [1] 0.773
- Test the hypothesis that \(\rho\) = 0.
The p-value is less than the significance level of 0.05 so we reject the null hypothesis.There is sufficient statistical evidence at the 0.05 level to conclude that there is a significant linear relationship between weight and systolic blood pressure.
cor.test(mydata5$weight, mydata5$sbp)
##
## Pearson's product-moment correlation
##
## data: mydata5$weight and mydata5$sbp
## t = 6, df = 24, p-value = 4e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.551 0.893
## sample estimates:
## cor
## 0.773
- Find a 95% CI for \(\rho\).
The 95% CI is (0.551, 0.893)
- Test the hypothesis that \(\rho\) = 0.6 (at \(\alpha = 0.05\) level).
We know from the above results that the 95% CI for slope is (0.551, 0.893). Since, 0.6 falls in between 0.551 and 0.893, we can tell that it satisfies the condition.