Note

Question 13

The data set \(\tt state.x77\) in \(\Re\) contains the following statistics (among others) related to the 50 states of the United States of America:

  • \(\tt Population\): population estimate (1975)

  • \(\tt Income\): per capita income (1974)

  • \(\tt Illiteracy\): illiterracy (1970, percent of population)

  • \(\tt HS.Grad\): percent high school graduates (1970)

state.data <- data.frame(state.x77)

We are interested in the relation between \(\tt Income\) and other 3 variables (\(\tt Population\), \(\tt Illiteracy\), \(\tt HS.Grad\)).

  1. Produce a 4 by 4 scatter plot of the variables above. (2 points)
head(state.data)
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
str(state.data)
## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...
a = c(2,3,4,7)
plot(state.data[,a], pch=23, bg='orange')

  1. Fit a multiple linear regression model to the data with \(\tt Income\) as the response variable, and \(\tt Population\), \(\tt Illiteracy\), \(\tt HS.Grad\) as the predictor variables. Comment on the significance of the variables in the model using the result of \(\tt summary()\). (2 points)
income.lm = lm(Income~Population+Illiteracy+HS.Grad,data = state.data )
summary(income.lm)
## 
## Call:
## lm(formula = Income ~ Population + Illiteracy + HS.Grad, data = state.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -987.30 -213.67  -50.68  219.74 1430.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1940.74115  710.64526   2.731 0.008924 ** 
## Population     0.03786    0.01500   2.524 0.015129 *  
## Illiteracy   -73.57563  145.07584  -0.507 0.614470    
## HS.Grad       45.57445   10.93778   4.167 0.000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 465.8 on 46 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.4254 
## F-statistic: 13.09 on 3 and 46 DF,  p-value: 2.612e-06
  1. Produce standard diagnostic plots of the multiple regression fit in part 2. (2 points)
par(mfrow=c(2, 2))
plot(income.lm)

  1. Plot \(\tt DFFITS\) of the observations and find observations which have high influence, using critical value \(0.5\). (2 points)
plot(abs(dffits(income.lm)), pch=23, bg='orange',cex=2,ylab="DFFITS")

  1. Plot Cook’s distance of the observations and find observations which have high influence, using critical value \(0.1\). Compare this results with the result of part 4. (2 points)
plot(cooks.distance(income.lm), pch=23, bg='orange', cex=2, ylab="Cook's distance")

state.data[which(cooks.distance(income.lm) > 0.1),]
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## New Mexico       1144   3601        2.2    70.32    9.7    55.2   120 121412
## Utah             1203   4022        0.6    72.90    4.5    67.3   137  82096
  1. Find states with outlying predictors by looking at the leverage values. Use critical value \(0.3\). (2 points)
state.data[which(hatvalues(income.lm) > 0.3), ] 
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
  1. Find outliers, if any, in the response. Remove them from the data and refit a multiple linear regression model and compare the result with the previous fit in part 2. (2 points)
n = nrow(state.data)
cutoff = qt(0.05 / (2 * n), (n - 5), lower.tail = FALSE)
state.data[which(abs(rstudent(income.lm)) > cutoff), ]
##        Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
income.lm1 = lm(Income ~ Population + Illiteracy + HS.Grad, data = state.data[-2,])
summary(income.lm1)
## 
## Call:
## lm(formula = Income ~ Population + Illiteracy + HS.Grad, data = state.data[-2, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -806.64 -223.85  -94.83  228.49  895.23 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2922.00458  669.84639   4.362 7.42e-05 ***
## Population     0.04463    0.01322   3.375  0.00153 ** 
## Illiteracy  -248.72103  134.46160  -1.850  0.07092 .  
## HS.Grad       29.75007   10.38054   2.866  0.00630 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407 on 45 degrees of freedom
## Multiple R-squared:  0.4997, Adjusted R-squared:  0.4663 
## F-statistic: 14.98 on 3 and 45 DF,  p-value: 6.723e-07
  1. As a summary, find all the influential states using \(\tt influence.measures\) function. (2 points)

Question 14

Consider the Scottish hills races data.

url = 'http://www.statsci.org/data/general/hills.txt'
races.table = read.table(url, header=TRUE, sep='\t')

Choose an observation index \(i\) (e.g. \(i = 33\), which corresponds to the outlying observation number 33) and create an dummy variable \(U\), where all the values of \(U\) are zero except for its \(i\)-th (33rd value) value which is one.

U <- rep(0, nrow(races.table))
U[33] <- 1

Now consider comparing the following models:

\[ H_0: \text{Time} = \beta_0 + \beta_1 \text{Distance} + \beta_2 \text{Climb} + \varepsilon , \quad \text{say, } \quad \text{Model 1} \]

\[ H_0: \text{Time} = \beta_0 + \beta_1 \text{Distance} + \beta_2 \text{Climb} + \beta_3 U + \varepsilon , \quad \text{say, } \quad \text{Model 2} \]

Let \(t_i^* = \dfrac{\hat{Y}_i - \hat{Y}_{(i),i}}{\hat{\sigma}_{(i)}\sqrt{1 - H_{ii}}}\) be the \(i\)-th externally standardized residual (studentized residuals), where \(\hat{\sigma}_{(i)}^2\) is the mean square error of fit without the \(i\)-th observation, \(H_{ii}\) is the \(i\)-th leverage value (That is the \(i\)-th diagonal element of \(\boldsymbol{X}\left( \boldsymbol{X}^{\rm T} \boldsymbol{X} \right)^{-1} \boldsymbol{X}^{\rm T}\)). Verify using the 33rd observation that

  1. The t-test statistic value for testing \(\beta_3=0\) in Moldel 2 is the same as the \(i\)-th (33rd) externally standardized residual (studentized residuals) obtained from Model 1. (2 points)
races.table$U=U
races.lm = lm(Time ~ Distance + Climb, data=races.table)
races.lm2 = lm(Time ~ Distance + Climb + U, data=races.table)
summary(races.lm2)
## 
## Call:
## lm(formula = Time ~ Distance + Climb + U, data = races.table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.898  -7.009  -1.321   1.764  64.724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.306261   4.433809  -1.873   0.0705 .  
## Distance     6.159210   0.610807  10.084 2.63e-11 ***
## Climb        0.010726   0.002112   5.079 1.71e-05 ***
## U           11.913956  16.241521   0.734   0.4687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.78 on 31 degrees of freedom
## Multiple R-squared:  0.9204, Adjusted R-squared:  0.9127 
## F-statistic: 119.5 on 3 and 31 DF,  p-value: < 2.2e-16
rstudent(races.lm)[33]
##        33 
## 0.7335493
  1. The F-test statistics value for testing Model 1 versus Model 2 reduces to the square of the \(i\)-th (33rd) externally standardized residual (studentized residuals). (2 points)
anova(races.lm,races.lm2)
## Analysis of Variance Table
## 
## Model 1: Time ~ Distance + Climb
## Model 2: Time ~ Distance + Climb + U
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     32 6891.9                           
## 2     31 6774.3  1    117.59 0.5381 0.4687
rstudent(races.lm)[33]^2
##        33 
## 0.5380945
  1. Fit Model 1 to the Scottish hills races data without the \(i\)-th (33rd) observation. (2 points)
races.lm3 = lm(Time ~ Distance + Climb, data=races.table[-33,])
races.lm3
## 
## Call:
## lm(formula = Time ~ Distance + Climb, data = races.table[-33, 
##     ])
## 
## Coefficients:
## (Intercept)     Distance        Climb  
##    -8.30626      6.15921      0.01073
  1. Show that the estimates of \(\beta_0\), \(\beta_1\), \(\beta_2\) in Model 2 are the same as those obtained in part 3. [Hence adding an indicator variable for the \(i\)-th observation is equivalent to deleting the corresponding observation!] (2 points)
races.lm2$coefficients
## (Intercept)    Distance       Climb           U 
## -8.30626092  6.15921043  0.01072625 11.91395619
races.lm3$coefficients
## (Intercept)    Distance       Climb 
## -8.30626092  6.15921043  0.01072625

Question 15

This question will review some of the fundamental concepts of the multiple linear regression model.

  1. Define the multiple linear regression model. [Use \(Y\) as response and \(X_1,~ X_2,~\ldots,~X_p\) as predictors and \(n\) observations.] (2 points)
  • \(Y=\beta_0+\beta_1 X_1+\cdots+\beta_p X_p+\epsilon\)
  1. Write down the assumptions of multiple linear regression model. (2 points)
  • \(\epsilon \sim N(0,\sigma^{2})\)

  • Observations forms a independent sample.

  1. What is the regression function in the multiple linear regression model? (2 points)
  • \(Y=\beta_0+\beta_1 X_1+\cdots+\beta_p X_p+\epsilon\)
  1. What about the regression function makes this model a linear model? (2 points)
  • \(\beta_0,\beta_1,\beta_2,\cdots,\beta_p\) are constants referred to as the model partial regression coefficients (or simply as the regression coefficients) and \(\epsilon\) is a random disturbance or error.It as assumed that for any set of fixed values of \(X_1,X_2,cdots,X_p\) that fall within the range of the data , the linear equation provides an acceptable approximation of the true relationship between Y and the X’s (Y is approximately a linear function of the X’s, and \(\epsilon\) measures the discrepancy in that approximation.
  1. What function might you minimize to estimate parameters in your multiple linear regression model. [No need to give too detailed an algorithm.] (2 points)
  • \(SSE(\beta_0,\beta_1,\beta_2,\cdots,\beta_p)=\sum_{i=0}^{n}[Y_i-(\beta_0+\sum_{i=0}^{p}\beta_j X_{ij})]^{2}\)
  1. Give an example of a regression function you might call nonlinear. (2 points)
  • \(Y=a*X^{\beta}\)