Exam 1

Q1. This question involves the use of simple linear regression on the Auto data set.

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output:

mydata <- read_xlsx("Auto.xlsx")

summary(mydata)

##       mpg         cylinders     displacement
##  Min.   : 9.0   Min.   :3.00   Min.   : 68  
##  1st Qu.:17.5   1st Qu.:4.00   1st Qu.:104  
##  Median :23.0   Median :4.00   Median :146  
##  Mean   :23.5   Mean   :5.46   Mean   :194  
##  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:262  
##  Max.   :46.6   Max.   :8.00   Max.   :455  
##   horsepower            weight      acceleration 
##  Length:397         Min.   :1613   Min.   : 8.0  
##  Class :character   1st Qu.:2223   1st Qu.:13.8  
##  Mode  :character   Median :2800   Median :15.5  
##                     Mean   :2970   Mean   :15.6  
##                     3rd Qu.:3609   3rd Qu.:17.1  
##                     Max.   :5140   Max.   :24.8  
##       year        origin    
##  Min.   :70   Min.   :1.00  
##  1st Qu.:73   1st Qu.:1.00  
##  Median :76   Median :1.00  
##  Mean   :76   Mean   :1.57  
##  3rd Qu.:79   3rd Qu.:2.00  
##  Max.   :82   Max.   :3.00

#Changing horsepower into quantitative variable
mydata$horsepower <- as.numeric(mydata$horsepower)

model <- lm(mpg ~ horsepower, data=mydata)
summary(model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.571  -3.259  -0.344   2.763  16.924 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.93586    0.71750    55.7   <2e-16 ***
## horsepower  -0.15784    0.00645   -24.5   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.91 on 390 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.606,  Adjusted R-squared:  0.605 
## F-statistic:  600 on 1 and 390 DF,  p-value: <2e-16

Is there a relationship between the predictor and the response in the sample?

Yes. From the above summary, we can see that the p-value is less than the significance level of 0.05 which is p-value: <2e-16.

Is there a relationship between the predictor and the response in the population?

We can see from the regression plot that there is a negative linear relationship between predictor variable horsepower and the response variable mpg. Also, from the regression summary, we can see that the p-value is less than the significance level of 0.05, so we reject the null hypothesis and conclude that there is a linear relationship between horsepower and mpg in the population.

 library(ggplot2)
ggplot(mydata, mapping = aes(x = horsepower, y=mpg)) + geom_point() + geom_smooth(method='lm', se=FALSE)

Describe the relationship between the predictor and the response based on the slope?

The slope of the regression line is -0.15784. This means that there is a negative linear relationship between the predictor (horsepower) and the response (mpg) variables. The model predicts that the mpg will decrease by 0.15784 for each additional horsepower.

What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

The predicted mpg associated with a horsepower of 98 is 24.5. The associated 95% confidence interval is (24, 25) and the associated 95% prediction interval is (14.8, 34.1)

new<-data.frame(horsepower=c(98))

predict(object= model, interval="confidence", level=0.95, se.fit = T, newdata=new )

## $fit
##    fit lwr upr
## 1 24.5  24  25
## 
## $se.fit
## [1] 0.251
## 
## $df
## [1] 390
## 
## $residual.scale
## [1] 4.91

predict(object= model, interval="prediction", level=0.95, se.fit=T, newdata=new)

## $fit
##    fit  lwr  upr
## 1 24.5 14.8 34.1
## 
## $se.fit
## [1] 0.251
## 
## $df
## [1] 390
## 
## $residual.scale
## [1] 4.91

Plot the response and the predictor. Use the geom_smooth() function to display the least squares regression line.

 library(ggplot2)
ggplot(mydata, mapping = aes(x = horsepower, y=mpg)) + geom_point() + geom_smooth(method='lm', se=FALSE)

Plot the 95% confidence and prediction bands. Explain the difference between them (Which one is wider, and why?).

95% prediction band is larger than 95% confidence band. This is because the prediction interval has an extra mean square error (MSE) compared to confidence interval.

 library(ggplot2)

pred <- predict(object= model, interval="prediction", level=0.95, newdata=mydata )

df_new <- cbind(mydata, pred)
head(df_new)

##   mpg cylinders displacement horsepower weight acceleration
## 1  18         8          307        130   3504         12.0
## 2  15         8          350        165   3693         11.5
## 3  18         8          318        150   3436         11.0
## 4  16         8          304        150   3433         12.0
## 5  17         8          302        140   3449         10.5
## 6  15         8          429        198   4341         10.0
##   year origin   fit   lwr  upr
## 1   70      1 19.42  9.75 29.1
## 2   70      1 13.89  4.20 23.6
## 3   70      1 16.26  6.58 25.9
## 4   70      1 16.26  6.58 25.9
## 5   70      1 17.84  8.17 27.5
## 6   70      1  8.68 -1.05 18.4

ggplot(df_new, mapping = aes(x = horsepower, y=mpg)) + geom_point()  + geom_smooth(method='lm', se=TRUE)+
  geom_line(aes(x=horsepower, y=lwr), color="red", linetype="dashed")+
  geom_line(aes(x=horsepower, y=upr), color="red", linetype="dashed")

Q2. Formulate the following regression function in matrix notation. If you plan use Latex for the equations, it must be readable. Unreadable answers will be assumed to be incorrect. If you finish the question in handwriting, please take a quality picture of your work and upload them. Illegible answers will be assumed to be incorrect.

\(\begin{equation} y_{i}=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\varepsilon_{i}. \text {for } i=1, ... , n\end{equation}\)

Q3 This question involves the use of multiple linear regression on the Credit data set.

Fit a multiple regression model relating balance (average credit card debt for each individual) to age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating).

mydata3 <- read_xlsx("Credit.xlsx")
head(mydata3)

## # A tibble: 6 x 11
##   Income Limit Rating Cards   Age Education Gender Student
##    <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl> <chr>  <chr>  
## 1   14.9  3606    283     2    34        11 Male   No     
## 2  106.   6645    483     3    82        15 Female Yes    
## 3  105.   7075    514     4    71        11 Male   No     
## 4  149.   9504    681     3    36        11 Female No     
## 5   55.9  4897    357     2    68        16 Male   No     
## 6   80.2  8047    569     4    77        10 Male   No     
## # … with 3 more variables: Married <chr>, Ethnicity <chr>,
## #   Balance <dbl>

model <- lm(Balance ~ Age + Cards + Education + Income + Limit + Rating, data=mydata3, x=T)
summary(model)

## 
## Call:
## lm(formula = Balance ~ Age + Cards + Education + Income + Limit + 
##     Rating, data = mydata3, x = T)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -227.2 -113.2  -42.1   45.8  543.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -477.958     55.065   -8.68   <2e-16 ***
## Age           -0.892      0.478   -1.87   0.0627 .  
## Cards         11.592      7.067    1.64   0.1017    
## Education      1.998      2.600    0.77   0.4426    
## Income        -7.558      0.382  -19.77   <2e-16 ***
## Limit          0.126      0.053    2.37   0.0181 *  
## Rating         2.063      0.794    2.60   0.0097 ** 
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162 on 393 degrees of freedom
## Multiple R-squared:  0.878,  Adjusted R-squared:  0.876 
## F-statistic:  472 on 6 and 393 DF,  p-value: <2e-16

For your model in part a (suppose we write it as \(Y=X\beta+\epsilon\)), find \(X\), \(Y\), and MSE

X<-model$x
head(X)

##   (Intercept) Age Cards Education Income Limit Rating
## 1           1  34     2        11   14.9  3606    283
## 2           1  82     3        15  106.0  6645    483
## 3           1  71     4        11  104.6  7075    514
## 4           1  36     3        11  148.9  9504    681
## 5           1  68     2        16   55.9  4897    357
## 6           1  77     4        10   80.2  8047    569

Y<-mydata3$Balance
Y

##   [1]  333  903  580  964  331 1151  203  872  279 1350 1407
##  [12]    0  204 1081  148    0    0  368  891 1048   89  968
##  [23]    0  411    0  671  654  467 1809  915  863    0  526
##  [34]    0    0  419  762 1093  531  344   50 1155  385  976
##  [45] 1120  997 1241  797    0  902  654  211  607  957    0
##  [56]    0  379  133  333  531  631  108    0  133    0  602
##  [67] 1388  889  822 1084  357 1103  663  601  945   29  532
##  [78]  145  391    0  162   99  503    0    0 1779  815    0
##  [89]  579 1176 1023  812    0  937    0    0 1380  155  375
## [100] 1311  298  431 1587 1050  745  210    0    0  227  297
## [111]   47    0 1046  768  271  510    0 1341    0    0    0
## [122]  454  904    0    0    0 1404    0 1259  255  868    0
## [133]  912 1018  835    8   75  187    0 1597 1425  605  669
## [144]  710   68  642  805    0    0    0  581  534  156    0
## [155]    0    0  429 1020  653    0  836    0 1086    0  548
## [166]  570    0    0    0 1099    0  283  108  724 1573    0
## [177]    0  384  453 1237  423  516  789    0 1448  450  188
## [188]    0  930  126  538 1687  336 1426    0  802  749   69
## [199]    0  571  829 1048    0 1411  456  638    0 1216  230
## [210]  732   95  799  308  637  681  246   52  955  195  653
## [221] 1246 1230 1549  573  701 1075 1032  482  156 1058  661
## [232]  657  689    0 1329  191  489  443   52  163  148    0
## [243]   16  856    0    0  199    0    0   98    0  132 1355
## [254]  218 1048  118    0    0    0 1092  345 1050  465  133
## [265]  651  549   15  942    0  772  136  436  728 1255  967
## [276]  529  209  531  250  269  541    0 1298  890    0    0
## [287]    0    0  863  485  159  309  481 1677    0    0  293
## [298]  188    0  711  580  172  295  414  905    0   70    0
## [309]  681  885 1036  844  823  843 1140  463 1142  136    0
## [320]    0    5   81  265 1999  415  732 1361  984  121  846
## [331] 1054  474  380  182  594  194  926    0  606 1107  320
## [342]  426  204  410  633    0  907 1192    0  503    0  302
## [353]  583  425  413 1405  962    0  347  611  712  382  710
## [364]  578 1243  790 1264  216  345 1208  992    0  840 1003
## [375]  588 1000  767    0  717    0  661  849 1352  382    0
## [386]  905  371    0 1129  806 1393  721    0    0  734  560
## [397]  480  138    0  966

MSE <- sum(residuals(model)^2) / model$df.residual 
MSE

## [1] 26129

Find least squares estimation \(b\) with \(X\) and \(Y\). How does this agrees with your results in part a?

b0 = -477.958, b1 =-0.892, b2=11.592, b3=1.998, b4=-7.558, b5=0.126, b6=2.063. These values are consistent with Coefficients from summary(model).

t(X) %*% X

##             (Intercept)      Age   Cards Education   Income
## (Intercept)         400 2.23e+04    1183      5380 1.81e+04
## Age               22267 1.36e+06   66260    299569 1.05e+06
## Cards              1183 6.63e+04    4249     15824 5.31e+04
## Education          5380 3.00e+05   15824     76258 2.42e+05
## Income            18088 1.05e+06   53142    242061 1.31e+06
## Limit           1894240 1.07e+08 5615136  25409750 1.11e+08
## Rating           141976 8.01e+06  424401   1903763 8.14e+06
##                Limit   Rating
## (Intercept) 1.89e+06 1.42e+05
## Age         1.07e+08 8.01e+06
## Cards       5.62e+06 4.24e+05
## Education   2.54e+07 1.90e+06
## Income      1.11e+08 8.14e+06
## Limit       1.11e+10 8.14e+08
## Rating      8.14e+08 5.99e+07

t(X) %*% Y

##                 [,1]
## (Intercept) 2.08e+05
## Age         1.16e+07
## Cards       6.37e+05
## Education   2.79e+06
## Income      1.24e+07
## Limit       1.35e+09
## Rating      9.83e+07

b_LeastSquareEstimation<-solve(t(X) %*% X) %*% (t(X) %*% Y)
b_LeastSquareEstimation

##                 [,1]
## (Intercept) -477.958
## Age           -0.892
## Cards         11.592
## Education      1.998
## Income        -7.558
## Limit          0.126
## Rating         2.063

Estimate \(\sigma^2\) and \(\sigma\)

sum(residuals(model)^2 / model$df.residual)

## [1] 26129

sqrt(sum(residuals(model)^2 / model$df.residual))

## [1] 162

Find the variance-covariance matrix of the regression parameters.

XTXinv <- solve(t(X) %*% X)

MSE*XTXinv

##             (Intercept)       Age   Cards Education
## (Intercept)     3032.19 -1.16e+01 -20.277 -96.72525
## Age              -11.64  2.29e-01  -0.139  -0.01381
## Cards            -20.28 -1.39e-01  49.938   0.13136
## Education        -96.73 -1.38e-02   0.131   6.75889
## Income             3.26 -2.91e-02   0.219   0.01378
## Limit              1.07  2.12e-04   0.202  -0.00914
## Rating           -17.42 -4.91e-04  -3.062   0.13765
##                Income     Limit    Rating
## (Intercept)  3.262661  1.071959 -1.74e+01
## Age         -0.029054  0.000212 -4.91e-04
## Cards        0.218834  0.201771 -3.06e+00
## Education    0.013777 -0.009140  1.38e-01
## Income       0.146207 -0.000333 -2.12e-02
## Limit       -0.000333  0.002813 -4.19e-02
## Rating      -0.021164 -0.041876  6.31e-01

Find standard errors of the regression parameters.

sqrt(MSE*diag(XTXinv))

## (Intercept)         Age       Cards   Education      Income 
##      55.065       0.478       7.067       2.600       0.382 
##       Limit      Rating 
##       0.053       0.794

Visualize the correlation matrix of regression parameters. What can you say about the correlation between Rating and Limit?

corrplot(cov2cor(MSE*XTXinv))

Find b2 = \((X'X+I)^{-1} X'Y\). (Generate \(I\) as I <- diag(dim(t(X) %*% X)[1]) ). Does this agrees with your results in part a?

No, it doesn’t agree with your results in part a since we get different values. The values are not consistent with Coefficients from summary(model).

I <- diag(dim(t(X) %*% X)[1])

b2 <- solve(t(X) %*% X + I) %*% (t(X) %*% Y)
b2

##                 [,1]
## (Intercept) -428.250
## Age           -1.083
## Cards         11.238
## Education      0.413
## Income        -7.505
## Limit          0.143
## Rating         1.779

Q4 This question involves the use of multiple linear regression on the Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set (hint: You may need to summary the dataset with summary(). Make sure all the variables are coded as quantitative variables. You may need to change the non-quantitative variables into quantitative with as.numeric()).

library(dplyr)
mydata4<- read_xlsx("Auto.xlsx")
head(mydata4)

## # A tibble: 6 x 8
##     mpg cylinders displacement horsepower weight
##   <dbl>     <dbl>        <dbl> <chr>       <dbl>
## 1    18         8          307 130          3504
## 2    15         8          350 165          3693
## 3    18         8          318 150          3436
## 4    16         8          304 150          3433
## 5    17         8          302 140          3449
## 6    15         8          429 198          4341
## # … with 3 more variables: acceleration <dbl>, year <dbl>,
## #   origin <dbl>

#Changing horsepower into quantitative variable
mydata4$horsepower <- as.numeric(mydata$horsepower)

ggpairs(mydata4)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables as the predictors. Use the summary() function to print the results. Comment on the output:

model <- lm(mpg ~ cylinders + displacement  + horsepower +weight + acceleration + origin + year, data=mydata4, x=T)
summary(model)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + origin + year, data = mydata4, x = T)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.590 -2.157 -0.117  1.869 13.060 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.72e+01   4.64e+00   -3.71  0.00024 ***
## cylinders    -4.93e-01   3.23e-01   -1.53  0.12780    
## displacement  1.99e-02   7.51e-03    2.65  0.00844 ** 
## horsepower   -1.70e-02   1.38e-02   -1.23  0.21963    
## weight       -6.47e-03   6.52e-04   -9.93  < 2e-16 ***
## acceleration  8.06e-02   9.88e-02    0.82  0.41548    
## origin        1.43e+00   2.78e-01    5.13  4.7e-07 ***
## year          7.51e-01   5.10e-02   14.73  < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.33 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.821,  Adjusted R-squared:  0.818 
## F-statistic:  252 on 7 and 384 DF,  p-value: <2e-16

How much percent of the variation in response is reduced by taking into account the predictors?

The value of Multiple R-squared is 0.821 and the value of Adjusted R-squared is 0.818. 82.1 % of the variation in response id reduced by taking into account the predictors.

For which of the predictors can you reject the null hypothesis \(H_0 :\beta_{j} =0\)?

We can reject the null hypothesis for the predictors that have a p-value less than 0.05. I have listed the predictors below for whom we can reject the null hypothesis.

weight
displacement
origin
year

What does the coefficient for the year variable suggest?

The variable year has a coefficient of 7.70e-01 which is equal to 0.77. It has a positive coefficient which indicates that as the value of the year variable increases, the mean of the mpg variable also tends to increase.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which you can reject the null hypothesis \(H_0 :\beta_{j} =0\).

model1<-lm(formula = mpg ~ displacement + weight + 
    origin + year, data = mydata4, x = T)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ displacement + weight + origin + year, data = mydata4, 
##     x = T)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.831 -2.119 -0.008  1.807 13.143 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.88e+01   3.99e+00   -4.72  3.3e-06 ***
## displacement  6.09e-03   4.76e-03    1.28      0.2    
## weight       -6.66e-03   5.56e-04  -11.98  < 2e-16 ***
## origin        1.23e+00   2.65e-01    4.65  4.5e-06 ***
## year          7.76e-01   4.95e-02   15.70  < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.35 on 392 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.817 
## F-statistic:  443 on 4 and 392 DF,  p-value: <2e-16

Find the adjusted \(R^2\) for the model in part (c). Compare this value to the adjusted \(R^2\) for the model in part (a). Discuss your results.

The value of adjusted R^2 for the model in part (c) is 0.817 which is slightly less than that of part (a) by 0.001. We can say that both have values similar to each other.

Using the model from (c), obtain 95% confidence intervals for the coefficient(s).

The confidence interval for displacement is (-0.00326, 0.01544), weight is (-0.00775, -0.00556), origin is (0.71286,1.75622), and year is (0.67922, 0.87368).

confint(object=model1, level=0.95)

##                  2.5 %    97.5 %
## (Intercept)  -26.68915 -10.99282
## displacement  -0.00326   0.01544
## weight        -0.00775  -0.00556
## origin         0.71286   1.75622
## year           0.67922   0.87368

Q5 The weight and systolic blood pressure of 26 randomly selected males in the age group 25–30 are shown in SBP.xlsx

Estimate and explain the correlation coefficient.

The correlation coefficient we get is 0.773. The positive sign of r tells us that the relationship is positive as weight increases, systolic blood pressure increases too.

mydata5 <- read_xlsx("SBP.xlsx")
head(mydata5)

## # A tibble: 6 x 2
##   weight   sbp
##    <dbl> <dbl>
## 1    165   130
## 2    167   133
## 3    180   150
## 4    155   128
## 5    212   151
## 6    175   146

cor(mydata5$weight, mydata5$sbp)

## [1] 0.773

Test the hypothesis that \(\rho\) = 0.

The p-value is less than the significance level of 0.05 so we reject the null hypothesis.There is sufficient statistical evidence at the 0.05 level to conclude that there is a significant linear relationship between weight and systolic blood pressure.

cor.test(mydata5$weight, mydata5$sbp)

## 
##  Pearson's product-moment correlation
## 
## data:  mydata5$weight and mydata5$sbp
## t = 6, df = 24, p-value = 4e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.551 0.893
## sample estimates:
##   cor 
## 0.773

Find a 95% CI for \(\rho\).

The 95% CI is (0.551, 0.893)

Test the hypothesis that \(\rho\) = 0.6 (at \(\alpha = 0.05\) level).

We know from the above results that the 95% CI for slope is (0.551, 0.893). Since, 0.6 falls in between 0.551 and 0.893, we can tell that it satisfies the condition.

Exam 1

2022-04-30