Read the following R code and outputs regarding a linear regression
model for the “gala” data from the Faraway book and answer
the questions. This data set concerns species diversity on the Galapagos
Islands. There are 30 Galapagos islands and 7 variables in the dataset.
The relationship between the number of plant species and several
geographic variables is of interest. The original dataset contained
several missing values which have been filled for convenience.
library(faraway)
## Warning: package 'faraway' was built under R version 4.5.2
data(gala,package="faraway")
#####First, fit the full model##
fullmod <- lm(Species~Area+Elevation+Nearest+Scruz+Adjacent,data=gala)
summary(fullmod)
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
##### Then, fit the reduced model#####
redmod <- lm(Species~Elevation+Nearest+Scruz+I(Area+Adjacent),data=gala)
######Compare two models#####
anova(redmod,fullmod)
# part 2
confint(fullmod, level = 0.9)
## 5 % 95 %
## (Intercept) -25.70235310 39.83879452
## Area -0.06230034 0.01442366
## Elevation 0.22765403 0.41127549
## Nearest -1.79435834 1.81264627
## Scruz -0.60905208 0.12800362
## Adjacent -0.10508777 -0.04452190
fullmod above?Answer: There are 30 Observations. 24 DF for the SSE for the fullmod.
Answer: [-0.10508777, -0.04452190]
Answer: Yes. We can reject the null hypothesis that Badjacent = 0 because 0 is not contained within the 90% confidence interval.
Elevation (i.e. \(\widehat\beta_{\text{Elevation}}\))?Answer: For every increase in elevation by one unit, the number of plant species will increase by 0.319465 on average.
Elevation is 0.319465, it means whenever
Elevation is increased by 1 unit, the mean value of
Species will increase by 0.319465. Another student believes
this is wrong since if you regress Species on
Elevation only (with no other predictors) then the
coefficient is 0.2007922 (see below.) Please elaborate on who is right
and who is wrong here.lm(Species~Elevation,data=gala)$coeff
## (Intercept) Elevation
## 11.3351132 0.2007922
Answer: Neither student is necessarily wrong here, they are just referring to different models. They are both correct within context of their respective models.
Answer: Null hypothesis: The reduced model is the best fit for the data. Alternative hypothesis: The full model is the best fit for the data.
Answer: Since Pr(>F) < 0.05, the null hypothesis is rejected. The full model should be used.
Answer: Fit the model
lm(Species~Elevation+Nearest+Scruz+I(Area-Adjacent),data=gala). Find the
summary of the model. Find the p-value of I(Area-Adjacent) from the
summary. If the p-value is less than the alpha level, then it is
considered significant and a reduced model would be better. If the
p-value is greater than the alpha level then it is not considered
significant, and one should opt for the full model.
vcov(fullmod) # one way to get variance-covriance matrix
## (Intercept) Area Elevation Nearest Scruz
## (Intercept) 366.88329428 0.1404740421 -0.5807385312 -0.869644244 -1.3980671735
## Area 0.14047404 0.0005027618 -0.0009642999 0.004811068 -0.0001826696
## Elevation -0.58073853 -0.0009642999 0.0028796966 -0.013196449 0.0011454447
## Nearest -0.86964424 0.0048110685 -0.0131964495 1.111202600 -0.1420666472
## Scruz -1.39806717 -0.0001826696 0.0011454447 -0.142066647 0.0463981286
## Adjacent 0.08587895 0.0001717816 -0.0006098372 0.005297104 -0.0007281114
## Adjacent
## (Intercept) 0.0858789494
## Area 0.0001717816
## Elevation -0.0006098372
## Nearest 0.0052971041
## Scruz -0.0007281114
## Adjacent 0.0003132967
X = model.matrix(fullmod) # another way
sigma2 = (summary(fullmod)$sigma)^2
sigma2 * solve(t(X)%*%X)
## (Intercept) Area Elevation Nearest Scruz
## (Intercept) 366.88329428 0.1404740421 -0.5807385312 -0.869644244 -1.3980671735
## Area 0.14047404 0.0005027618 -0.0009642999 0.004811068 -0.0001826696
## Elevation -0.58073853 -0.0009642999 0.0028796966 -0.013196449 0.0011454447
## Nearest -0.86964424 0.0048110685 -0.0131964495 1.111202600 -0.1420666472
## Scruz -1.39806717 -0.0001826696 0.0011454447 -0.142066647 0.0463981286
## Adjacent 0.08587895 0.0001717816 -0.0006098372 0.005297104 -0.0007281114
## Adjacent
## (Intercept) 0.0858789494
## Area 0.0001717816
## Elevation -0.0006098372
## Nearest 0.0052971041
## Scruz -0.0007281114
## Adjacent 0.0003132967
# var(area) + var(adjacent) + 2cov(area + adjacent)
0.0005027618 + 0.0003132967 + 2*0.0001717816
## [1] 0.001159622
Given such information, compute the variance of \(\widehat\beta_{\text{Area}} - \widehat\beta_{\text{Adjacent}}\).
Answer: 0.001159622
Answer: 0.7658
% of the variance seen in the predicted number of species can be accounted for by the predictors.
An analyst for the auto industry has asked for your help in modeling data on the prices of new cars. Interest centers on modeling the effects of different aspects of a car on its suggested retail price. Data are available for all 234 cars on the following variables: Y = Suggested Retail Price; x1 = Engine size; x2 = Cylinders; x3 = Horse power; x4 = Highway mpg; x5 = Weight; x6 = Wheel Base; x7 = Hybrid, a dummy variable which is 1 for so-called hybrid cars.
url = "http://people.math.binghamton.edu/qiao/data501/data/cars04.csv"
car <- read.csv(url,header=T)
car_models <- car[,1]
car <- car[,c("SuggestedRetailPrice","EngineSize","Cylinders","Horsepower",
"HighwayMPG","Weight","WheelBase","Hybrid")]
summary(car)
## SuggestedRetailPrice EngineSize Cylinders Horsepower
## Min. : 10280 Min. :1.400 Min. : 3.000 Min. : 73.0
## 1st Qu.: 19161 1st Qu.:2.200 1st Qu.: 4.000 1st Qu.:150.0
## Median : 26008 Median :2.850 Median : 6.000 Median :200.0
## Mean : 29757 Mean :2.899 Mean : 5.517 Mean :199.8
## 3rd Qu.: 36831 3rd Qu.:3.500 3rd Qu.: 6.000 3rd Qu.:232.0
## Max. :128420 Max. :5.500 Max. :12.000 Max. :493.0
## HighwayMPG Weight WheelBase Hybrid
## Min. :19.0 Min. :1850 Min. : 93.0 Min. :0.00000
## 1st Qu.:26.0 1st Qu.:2937 1st Qu.:104.0 1st Qu.:0.00000
## Median :28.0 Median :3388 Median :107.0 Median :0.00000
## Mean :29.4 Mean :3313 Mean :107.1 Mean :0.01282
## 3rd Qu.:31.0 3rd Qu.:3650 3rd Qu.:111.0 3rd Qu.:0.00000
## Max. :66.0 Max. :4474 Max. :124.0 Max. :1.00000
lmod1 = lm(SuggestedRetailPrice ~ EngineSize + Cylinders +
Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data=car)
summary(lmod1)
##
## Call:
## lm(formula = SuggestedRetailPrice ~ EngineSize + Cylinders +
## Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17436 -4134 173 3561 46392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -68965.793 16180.381 -4.262 2.97e-05 ***
## EngineSize -6957.457 1600.137 -4.348 2.08e-05 ***
## Cylinders 3564.755 969.633 3.676 0.000296 ***
## Horsepower 179.702 16.411 10.950 < 2e-16 ***
## HighwayMPG 637.939 202.724 3.147 0.001873 **
## Weight 11.911 2.658 4.481 1.18e-05 ***
## WheelBase 47.607 178.070 0.267 0.789444
## Hybrid 431.759 6092.087 0.071 0.943562
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7533 on 226 degrees of freedom
## Multiple R-squared: 0.7819, Adjusted R-squared: 0.7751
## F-statistic: 115.7 on 7 and 226 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod1)
plot(residuals(lmod1))
lmod1) is a good model for the
data set. Give reasons to support your answer.Answer: While the model has a high r^2 value and a significant p-value, I don’t believe this model is a a good fit because it fails several regression assumptions. The residuals vs fitted plot shows a fanning pattern, which suggests an unequal variance among the errors. Additionally, the QQplot strays from a straight line which suggests that the errors are not normal.
Answer: The residuals vs fitted plot having a curved patterns suggests that the errors of model I have an unequal variance.
Answer: The residual plot seems to follow a fanning pattern along with having some under predicted outliers. To fix this issue, a transformation such as taking the log of the price, or adding a quadratic term may help the model.
Answer: According to the Residuals vs leverage plot, there are a few points with high leverage, but only one point, 223, that falls outside of cook’s distance and is considered influential.
Answer: No. This model accounts not only for cylinders, but several other predictor variables. If we didn’t know any other information about the car, it should not be used for this model. Additionally, since this model fails several assumptions, it should not be used even if we had all the information about the car.
Answer: Cars 222, 223, and 229 are the three most expensive cars according to suggested retail price, and are significantly higher priced than the other cars in the dataset. This leads to the model under-predicting on these cars because it is based primarily on other cheaper cars. If possible these cars should be looked into to determine if there is an external factor leading to their high price such as car brand.
Car #67 is at the extreme end of several predictors. It is one of three hybrid cars in the dataset, it is the only car with 3 cylinders which is the lowest amount for any car, it has the highest highway MPG, the lowest weight and the lowest horsepower. In many ways, it is not similar to the other cars in the dataset which gives it a high leverage score.
lm object
lmod1_removed.car_removed <- car[c(-67,-222,-223,-229),]
lmod1_removed = lm(SuggestedRetailPrice ~ EngineSize + Cylinders +
Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data=car_removed)
summary(lmod1_removed)
##
## Call:
## lm(formula = SuggestedRetailPrice ~ EngineSize + Cylinders +
## Horsepower + HighwayMPG + Weight + WheelBase + Hybrid, data = car_removed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14056.6 -3080.5 -92.3 3034.4 25015.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72284.560 12627.077 -5.725 3.34e-08 ***
## EngineSize -8443.214 1313.408 -6.428 7.78e-10 ***
## Cylinders 3160.193 789.168 4.004 8.48e-05 ***
## Horsepower 154.619 13.155 11.753 < 2e-16 ***
## HighwayMPG 409.216 161.388 2.536 0.0119 *
## Weight 12.391 2.061 6.013 7.40e-09 ***
## WheelBase 228.739 140.926 1.623 0.1060
## Hybrid -3087.508 5058.179 -0.610 0.5422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5829 on 222 degrees of freedom
## Multiple R-squared: 0.8171, Adjusted R-squared: 0.8114
## F-statistic: 141.7 on 7 and 222 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod1_removed)
The data for this example come from a study of the effects of
childhood sexual abuse on adult females reported in Rodriguez et
al. (1997): 45 women treated at a clinic, who reported childhood sexual
abuse (csa), were measured for post-traumatic stress
disorder (ptsd) and childhood physical abuse
(cpa) both on standardized scales. Thirty-one women treated
at the same clinic, who did not report childhood sexual abuse, were also
measured.
Scatterplots between ptsd and cpa are shown
below. The points are the same between the left and the right panels. On
the left panel, a linear regression model with ptsd on
cpa is fitted to the entire data. On the right panel, the
same model is fitted to the two subgroups csa=1 and
csa=0 separately.
data(sexab,package="faraway")
library(ggplot2)
p1 = ggplot(data=sexab, aes(x = cpa, y = ptsd)) +
geom_point(aes(shape = csa)) + geom_smooth(method='lm') +
theme(legend.position="bottom")
p2 = ggplot(data=sexab, aes(x = cpa, y = ptsd)) +
geom_point(aes(shape = csa, color = csa)) + geom_smooth(method='lm', aes(group = csa, color = csa)) +
theme(legend.position="bottom")
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.5.2
grid.arrange(p1, p2, nrow = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
csa information is not
used in this regression model. Explain, intuitively, why one may still
want to use the csa information if the one on the left hand
side seems to be adequate?Answer: Even if the plot on the left seems to follow the regression line, there may be a better fit when differentiating by csa. Intuitively, it makes sense that someone is more likely to experience PTSD if they faced two types of abuse as opposed to one.
cpa should be different between the two subgroups? Why or
why not? You do not have to do rigorous test here. You simply need to
speak about your hunch.Answer: No I don’t think it should be different. While the lines for each subgroup do have a different slope, I don’t think it’s a large enough difference to claim that the true slope is different between subgroups, especially with only 76 observations.
csa. Whether you will include the
interaction term or not must be consistent with your answer in the last
question. Report the summary report.ptsd_model <- lm(ptsd ~ cpa + csa, data = sexab)
summary(ptsd_model)
##
## Call:
## lm(formula = ptsd ~ cpa + csa, data = sexab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1567 -2.3643 -0.1533 2.1466 7.1417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.2480 0.7187 14.260 < 2e-16 ***
## cpa 0.5506 0.1716 3.209 0.00198 **
## csaNotAbused -6.2728 0.8219 -7.632 6.91e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.273 on 73 degrees of freedom
## Multiple R-squared: 0.5786, Adjusted R-squared: 0.5671
## F-statistic: 50.12 on 2 and 73 DF, p-value: 2.002e-14
anova, and report your
conclusion.ptsd_int_model <- lm(ptsd ~ cpa + csa + cpa:csa , data = sexab)
anova(ptsd_model, ptsd_int_model)
Conclusion: Fail to reject the null that the reduced model (without an interaction term) is a better fit since Pr(>F) > 0.05.
cpa:csaNotAbused, what conclusion can you make for this
t-test? What interesting observation can you make about the p-value
here?summary(ptsd_int_model)
##
## Call:
## lm(formula = ptsd ~ cpa + csa + cpa:csa, data = sexab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1999 -2.5313 -0.1807 2.7744 6.9748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5571 0.8063 13.094 < 2e-16 ***
## cpa 0.4500 0.2085 2.159 0.0342 *
## csaNotAbused -6.8612 1.0747 -6.384 1.48e-08 ***
## cpa:csaNotAbused 0.3140 0.3685 0.852 0.3970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.279 on 72 degrees of freedom
## Multiple R-squared: 0.5828, Adjusted R-squared: 0.5654
## F-statistic: 33.53 on 3 and 72 DF, p-value: 1.133e-13
Answer: Since Pr(>|t|) > 0.05, we can conclude that the interaction term is not significant to the model. The p-value is the same as the p-value from the anova model that was conducted previously.
plot(ptsd_model)
Answer: The QQ-plot mostly follows a straight line, and does not suggest
a non-normal distribution from the errors. The residuals vs leverage
plot does not suggest any outliers/influential points. Both the
Scale-location plot and the residuals vs fitted plot show two separate
clusters of points, which may suggest unequal variance. However, since
we know that the data is split based on csa, which can be seen in the
initial scatter plot, I don’t think it is an issue. Additionally, the
points within each group appear to be random. Ultimately, I don’t see
anything wrong with these diagnostic plots.