Question 1

Read the following R code and outputs regarding a linear regression model for the “gala” data from the Faraway book and answer the questions. This data set concerns species diversity on the Galapagos Islands. There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. The original dataset contained several missing values which have been filled for convenience.

library(faraway)

## Warning: package 'faraway' was built under R version 4.5.2

data(gala,package="faraway")
#####First, fit the full model##
fullmod <- lm(Species~Area+Elevation+Nearest+Scruz+Adjacent,data=gala)
summary(fullmod)

## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

##### Then, fit the reduced model#####
redmod <- lm(Species~Elevation+Nearest+Scruz+I(Area+Adjacent),data=gala)
######Compare two models#####
anova(redmod,fullmod)

# part 2
confint(fullmod, level = 0.9)

##                      5 %        95 %
## (Intercept) -25.70235310 39.83879452
## Area         -0.06230034  0.01442366
## Elevation     0.22765403  0.41127549
## Nearest      -1.79435834  1.81264627
## Scruz        -0.60905208  0.12800362
## Adjacent     -0.10508777 -0.04452190

What is the total number of observations in this data set? What is the degrees of freedom associated with the SSE for the fullmod above?

Answer: There are 30 Observations. 24 DF for the SSE for the fullmod.

Find the 90% confidence interval for \(\beta_{\text{Adjacent}}\). You can either use numbers already reported above or use a built-in function to get the confidence interval. Please address the question directly by typing down the answer. I will not be able to locate the answer from a long R output. You must tell me what your answer is. Same comments for all the questions below.

Answer: [-0.10508777, -0.04452190]

Based on the 90% confidence interval above, can you reject the null hypothesis \(\beta_{\text{Adjacent}}=0\) vs. the alternative hypothesis \(\beta_{\text{Adjacent}}\neq 0\)? Why?

Answer: Yes. We can reject the null hypothesis that Badjacent = 0 because 0 is not contained within the 90% confidence interval.

What is the interpretation of the estimated coefficient for Elevation (i.e. \(\widehat\beta_{\text{Elevation}}\))?

Answer: For every increase in elevation by one unit, the number of plant species will increase by 0.319465 on average.

Some student claims that since the estimated coefficient for Elevation is 0.319465, it means whenever Elevation is increased by 1 unit, the mean value of Species will increase by 0.319465. Another student believes this is wrong since if you regress Species on Elevation only (with no other predictors) then the coefficient is 0.2007922 (see below.) Please elaborate on who is right and who is wrong here.

lm(Species~Elevation,data=gala)$coeff

## (Intercept)   Elevation 
##  11.3351132   0.2007922

Answer: Neither student is necessarily wrong here, they are just referring to different models. They are both correct within context of their respective models.

Write down the null hypothesis \(H_0\) and the alternative hypothesis \(H_a\) for the ANOVA F-test given in the above R output that compares the full model and the reduced model.

Answer: Null hypothesis: The reduced model is the best fit for the data. Alternative hypothesis: The full model is the best fit for the data.

What is the conclusion of the above ANOVA F-test at the 5% level? Should we use the full model or the reduced model?

Answer: Since Pr(>F) < 0.05, the null hypothesis is rejected. The full model should be used.

The same set of null and alternative hypotheses can also be tested using a t-test, by working with \(\beta_{\text{Area}} - \beta_{\text{Adjacent}}\), as well as its estimate \(\widehat\beta_{\text{Area}} - \widehat\beta_{\text{Adjacent}}\). Briefly write down the plan as to how to conduct this test. You do not have to actually complete this test, but you need to provide the plan on how to execute it.

Answer: Fit the model lm(Species~Elevation+Nearest+Scruz+I(Area-Adjacent),data=gala). Find the summary of the model. Find the p-value of I(Area-Adjacent) from the summary. If the p-value is less than the alpha level, then it is considered significant and a reduced model would be better. If the p-value is greater than the alpha level then it is not considered significant, and one should opt for the full model.

It is known that the variance-covariance matrix for the vector of \(\widehat\beta_j\)’s is estimated as \(\hat\sigma^2(\mX^T\mX)^{-1}\). It may be computed in either of the following two ways.

vcov(fullmod) # one way to get variance-covriance matrix

##              (Intercept)          Area     Elevation      Nearest         Scruz
## (Intercept) 366.88329428  0.1404740421 -0.5807385312 -0.869644244 -1.3980671735
## Area          0.14047404  0.0005027618 -0.0009642999  0.004811068 -0.0001826696
## Elevation    -0.58073853 -0.0009642999  0.0028796966 -0.013196449  0.0011454447
## Nearest      -0.86964424  0.0048110685 -0.0131964495  1.111202600 -0.1420666472
## Scruz        -1.39806717 -0.0001826696  0.0011454447 -0.142066647  0.0463981286
## Adjacent      0.08587895  0.0001717816 -0.0006098372  0.005297104 -0.0007281114
##                  Adjacent
## (Intercept)  0.0858789494
## Area         0.0001717816
## Elevation   -0.0006098372
## Nearest      0.0052971041
## Scruz       -0.0007281114
## Adjacent     0.0003132967

X = model.matrix(fullmod) # another way
sigma2 = (summary(fullmod)$sigma)^2
sigma2 * solve(t(X)%*%X)

##              (Intercept)          Area     Elevation      Nearest         Scruz
## (Intercept) 366.88329428  0.1404740421 -0.5807385312 -0.869644244 -1.3980671735
## Area          0.14047404  0.0005027618 -0.0009642999  0.004811068 -0.0001826696
## Elevation    -0.58073853 -0.0009642999  0.0028796966 -0.013196449  0.0011454447
## Nearest      -0.86964424  0.0048110685 -0.0131964495  1.111202600 -0.1420666472
## Scruz        -1.39806717 -0.0001826696  0.0011454447 -0.142066647  0.0463981286
## Adjacent      0.08587895  0.0001717816 -0.0006098372  0.005297104 -0.0007281114
##                  Adjacent
## (Intercept)  0.0858789494
## Area         0.0001717816
## Elevation   -0.0006098372
## Nearest      0.0052971041
## Scruz       -0.0007281114
## Adjacent     0.0003132967

# var(area) + var(adjacent) + 2cov(area + adjacent)
0.0005027618 + 0.0003132967 + 2*0.0001717816

## [1] 0.001159622

Given such information, compute the variance of \(\widehat\beta_{\text{Area}} - \widehat\beta_{\text{Adjacent}}\).

Answer: 0.001159622

What is the R-square value for the full model?

Answer: 0.7658

How would you interpret the R-square value?

% of the variance seen in the predicted number of species can be accounted for by the predictors.

Question 2

An analyst for the auto industry has asked for your help in modeling data on the prices of new cars. Interest centers on modeling the effects of different aspects of a car on its suggested retail price. Data are available for all 234 cars on the following variables: Y = Suggested Retail Price; x1 = Engine size; x2 = Cylinders; x3 = Horse power; x4 = Highway mpg; x5 = Weight; x6 = Wheel Base; x7 = Hybrid, a dummy variable which is 1 for so-called hybrid cars.

url = "http://people.math.binghamton.edu/qiao/data501/data/cars04.csv"
car <- read.csv(url,header=T)
car_models <- car[,1]
car <- car[,c("SuggestedRetailPrice","EngineSize","Cylinders","Horsepower",
              "HighwayMPG","Weight","WheelBase","Hybrid")]

summary(car)

##  SuggestedRetailPrice   EngineSize      Cylinders        Horsepower   
##  Min.   : 10280       Min.   :1.400   Min.   : 3.000   Min.   : 73.0  
##  1st Qu.: 19161       1st Qu.:2.200   1st Qu.: 4.000   1st Qu.:150.0  
##  Median : 26008       Median :2.850   Median : 6.000   Median :200.0  
##  Mean   : 29757       Mean   :2.899   Mean   : 5.517   Mean   :199.8  
##  3rd Qu.: 36831       3rd Qu.:3.500   3rd Qu.: 6.000   3rd Qu.:232.0  
##  Max.   :128420       Max.   :5.500   Max.   :12.000   Max.   :493.0  
##    HighwayMPG       Weight       WheelBase         Hybrid       
##  Min.   :19.0   Min.   :1850   Min.   : 93.0   Min.   :0.00000  
##  1st Qu.:26.0   1st Qu.:2937   1st Qu.:104.0   1st Qu.:0.00000  
##  Median :28.0   Median :3388   Median :107.0   Median :0.00000  
##  Mean   :29.4   Mean   :3313   Mean   :107.1   Mean   :0.01282  
##  3rd Qu.:31.0   3rd Qu.:3650   3rd Qu.:111.0   3rd Qu.:0.00000  
##  Max.   :66.0   Max.   :4474   Max.   :124.0   Max.   :1.00000

lmod1 = lm(SuggestedRetailPrice ~ EngineSize + Cylinders + 
             Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data=car)

summary(lmod1)

## 
## Call:
## lm(formula = SuggestedRetailPrice ~ EngineSize + Cylinders + 
##     Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data = car)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17436  -4134    173   3561  46392 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -68965.793  16180.381  -4.262 2.97e-05 ***
## EngineSize   -6957.457   1600.137  -4.348 2.08e-05 ***
## Cylinders     3564.755    969.633   3.676 0.000296 ***
## Horsepower     179.702     16.411  10.950  < 2e-16 ***
## HighwayMPG     637.939    202.724   3.147 0.001873 ** 
## Weight          11.911      2.658   4.481 1.18e-05 ***
## WheelBase       47.607    178.070   0.267 0.789444    
## Hybrid         431.759   6092.087   0.071 0.943562    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7533 on 226 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.7751 
## F-statistic: 115.7 on 7 and 226 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod1)

plot(residuals(lmod1))

Decide whether Model I (lmod1) is a good model for the data set. Give reasons to support your answer.

Answer: While the model has a high r^2 value and a significant p-value, I don’t believe this model is a a good fit because it fails several regression assumptions. The residuals vs fitted plot shows a fanning pattern, which suggests an unequal variance among the errors. Additionally, the QQplot strays from a straight line which suggests that the errors are not normal.

The plot of residuals against fitted values produces a curved pattern. Describe what, if anything, can be learned about Model I from the diagnostic plots along this direction. For this question, do not use anything beyond a visual check of the residual plot.

Answer: The residuals vs fitted plot having a curved patterns suggests that the errors of model I have an unequal variance.

What else regarding the error term of the regression model can we learn from the residual plot of this model? If there is anything wrong with the residual plot, what can you do to fix this issue? (Explain.) For this question, do not use anything beyond a visual check of the residual plot.

Answer: The residual plot seems to follow a fanning pattern along with having some under predicted outliers. To fix this issue, a transformation such as taking the log of the price, or adding a quadratic term may help the model.

Do we need to worry about any outliers/influential points from the first, third, and fourth diagnostic plots?

Answer: According to the Residuals vs leverage plot, there are a few points with high leverage, but only one point, 223, that falls outside of cook’s distance and is considered influential.

Should we use this model to predict the suggested retail price of a car with two cylinders? (Explain)

Answer: No. This model accounts not only for cylinders, but several other predictor variables. If we didn’t know any other information about the car, it should not be used for this model. Additionally, since this model fails several assumptions, it should not be used even if we had all the information about the car.

What happens to cars #222, #223, #229? Can you explain why they have such large residuals? What should you do about these three cars?

Answer: Cars 222, 223, and 229 are the three most expensive cars according to suggested retail price, and are significantly higher priced than the other cars in the dataset. This leads to the model under-predicting on these cars because it is based primarily on other cheaper cars. If possible these cars should be looked into to determine if there is an external factor leading to their high price such as car brand.

What happens to cars #67? Why it has such a large leverage? You may Google this car and find some explanations.

Car #67 is at the extreme end of several predictors. It is one of three hybrid cars in the dataset, it is the only car with 3 cylinders which is the lowest amount for any car, it has the highest highway MPG, the lowest weight and the lowest horsepower. In many ways, it is not similar to the other cars in the dataset which gives it a high leverage score.

Refit Model 1 with #222, #223, #229 and #67 removed, and redraw the diagnostic plots. Call the lm object lmod1_removed.

car_removed <- car[c(-67,-222,-223,-229),]
lmod1_removed = lm(SuggestedRetailPrice ~ EngineSize + Cylinders + 
             Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data=car_removed)

summary(lmod1_removed)

## 
## Call:
## lm(formula = SuggestedRetailPrice ~ EngineSize + Cylinders + 
##     Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data = car_removed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14056.6  -3080.5    -92.3   3034.4  25015.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -72284.560  12627.077  -5.725 3.34e-08 ***
## EngineSize   -8443.214   1313.408  -6.428 7.78e-10 ***
## Cylinders     3160.193    789.168   4.004 8.48e-05 ***
## Horsepower     154.619     13.155  11.753  < 2e-16 ***
## HighwayMPG     409.216    161.388   2.536   0.0119 *  
## Weight          12.391      2.061   6.013 7.40e-09 ***
## WheelBase      228.739    140.926   1.623   0.1060    
## Hybrid       -3087.508   5058.179  -0.610   0.5422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5829 on 222 degrees of freedom
## Multiple R-squared:  0.8171, Adjusted R-squared:  0.8114 
## F-statistic: 141.7 on 7 and 222 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod1_removed)

Question 3

The data for this example come from a study of the effects of childhood sexual abuse on adult females reported in Rodriguez et al. (1997): 45 women treated at a clinic, who reported childhood sexual abuse (csa), were measured for post-traumatic stress disorder (ptsd) and childhood physical abuse (cpa) both on standardized scales. Thirty-one women treated at the same clinic, who did not report childhood sexual abuse, were also measured.

Scatterplots between ptsd and cpa are shown below. The points are the same between the left and the right panels. On the left panel, a linear regression model with ptsd on cpa is fitted to the entire data. On the right panel, the same model is fitted to the two subgroups csa=1 and csa=0 separately.

data(sexab,package="faraway")
library(ggplot2)
p1 = ggplot(data=sexab, aes(x = cpa, y = ptsd)) + 
  geom_point(aes(shape = csa)) + geom_smooth(method='lm') + 
  theme(legend.position="bottom")
p2 = ggplot(data=sexab, aes(x = cpa, y = ptsd)) + 
  geom_point(aes(shape = csa, color = csa)) + geom_smooth(method='lm', aes(group = csa, color = csa)) + 
  theme(legend.position="bottom")

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.5.2

grid.arrange(p1, p2, nrow = 1)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

It seems that the data points in the left panel more or less follow the regression line. Note that the csa information is not used in this regression model. Explain, intuitively, why one may still want to use the csa information if the one on the left hand side seems to be adequate?

Answer: Even if the plot on the left seems to follow the regression line, there may be a better fit when differentiating by csa. Intuitively, it makes sense that someone is more likely to experience PTSD if they faced two types of abuse as opposed to one.

Based on the right panel plot, do you think the true slope for cpa should be different between the two subgroups? Why or why not? You do not have to do rigorous test here. You simply need to speak about your hunch.

Answer: No I don’t think it should be different. While the lines for each subgroup do have a different slope, I don’t think it’s a large enough difference to claim that the true slope is different between subgroups, especially with only 76 observations.

Based on your hunch in the last question, fit a model which incorporates the variable csa. Whether you will include the interaction term or not must be consistent with your answer in the last question. Report the summary report.

ptsd_model <- lm(ptsd ~ cpa + csa, data = sexab)
summary(ptsd_model)

## 
## Call:
## lm(formula = ptsd ~ cpa + csa, data = sexab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1567 -2.3643 -0.1533  2.1466  7.1417 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.2480     0.7187  14.260  < 2e-16 ***
## cpa            0.5506     0.1716   3.209  0.00198 ** 
## csaNotAbused  -6.2728     0.8219  -7.632 6.91e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.273 on 73 degrees of freedom
## Multiple R-squared:  0.5786, Adjusted R-squared:  0.5671 
## F-statistic: 50.12 on 2 and 73 DF,  p-value: 2.002e-14

Now formally test where there should be an interaction using ANOVA F-test. Fit both a model with interaction and a model without interaction, then compare them using anova, and report your conclusion.

ptsd_int_model <- lm(ptsd ~ cpa + csa + cpa:csa , data = sexab)
anova(ptsd_model, ptsd_int_model)

Conclusion: Fail to reject the null that the reduced model (without an interaction term) is a better fit since Pr(>F) > 0.05.

Report the coefficient summary for the model with interaction. From the t-value and the associated p-value for the term cpa:csaNotAbused, what conclusion can you make for this t-test? What interesting observation can you make about the p-value here?

summary(ptsd_int_model)

## 
## Call:
## lm(formula = ptsd ~ cpa + csa + cpa:csa, data = sexab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1999 -2.5313 -0.1807  2.7744  6.9748 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.5571     0.8063  13.094  < 2e-16 ***
## cpa                0.4500     0.2085   2.159   0.0342 *  
## csaNotAbused      -6.8612     1.0747  -6.384 1.48e-08 ***
## cpa:csaNotAbused   0.3140     0.3685   0.852   0.3970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.279 on 72 degrees of freedom
## Multiple R-squared:  0.5828, Adjusted R-squared:  0.5654 
## F-statistic: 33.53 on 3 and 72 DF,  p-value: 1.133e-13

Answer: Since Pr(>|t|) > 0.05, we can conclude that the interaction term is not significant to the model. The p-value is the same as the p-value from the anova model that was conducted previously.

Draw the four diagnostic plots below for the model without interaction. Are you concerned about anything for this model? Explain.

plot(ptsd_model)

Answer: The QQ-plot mostly follows a straight line, and does not suggest a non-normal distribution from the errors. The residuals vs leverage plot does not suggest any outliers/influential points. Both the Scale-location plot and the residuals vs fitted plot show two separate clusters of points, which may suggest unequal variance. However, since we know that the data is split based on csa, which can be seen in the initial scatter plot, I don’t think it is an issue. Additionally, the points within each group appear to be random. Ultimately, I don’t see anything wrong with these diagnostic plots.

Data 501: Midterm Exam

Exam date = 11/13/2025

Question 1

Question 2

Question 3