Name(s): Aiza K
hd <- read.csv("C:/Users/aizax94/Downloads/hd.csv")
View(hd)
model <- lm(SBP ~ Age + Smoke, hd)
model
##
## Call:
## lm(formula = SBP ~ Age + Smoke, data = hd)
##
## Coefficients:
## (Intercept) Age Smoke
## 62.389 1.464 7.882
summary(model)
##
## Call:
## lm(formula = SBP ~ Age + Smoke, data = hd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.963 -5.596 -1.839 4.886 19.460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.3892 11.8647 5.258 1.24e-05 ***
## Age 1.4639 0.2265 6.462 4.52e-07 ***
## Smoke 7.8818 3.1082 2.536 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.507 on 29 degrees of freedom
## Multiple R-squared: 0.6734, Adjusted R-squared: 0.6508
## F-statistic: 29.89 on 2 and 29 DF, p-value: 8.992e-08
After adjusting for Age, for someone who smokes predicted SBP increases by 7.88. After adjusting for Smoke, for every 1 year increase in age predicted SBP increases by 1.46.
In this case, the value for r-squared is 0.67. This means 67% of the variability in SBP can be associated with the linear relationship between SBP with Age and Smoke.
(b)Use your model in part (a) to predict the mean SBP for 53-year-old non-smokers and the mean SBP for 65-year-old smokers. Obtain a 90% CI for the the mean SBP for 53-year-old non-smokers and for the mean SBP for 65-year-old smokers. Can you interpret this second interval?
k <- data.frame(Age = 53, Smoke = 0)
p <- predict(model, newdata = k, interval = "confidence", level = 0.9)
round(p, 2)
## fit lwr upr
## 1 139.98 136.2 143.76
We can be 90% confident that the mean SBP for 53 year-old non- smokers lies between 136.2 and 143.8 mm.
k2 <- data.frame(Age = 65, Smoke = 1)
p2 <- predict(model, newdata = k2, interval = "confidence", level = 0.9)
round(p2, 2)
## fit lwr upr
## 1 165.43 160.17 170.68
We can be 90% confident that the mean SBP for 65 year-old smokers lies between 160.2 and 170.7 mm.
cor(hd$SBP, hd$BMI)
## [1] 0.7409754
model2 <- lm(SBP ~ Age + Smoke + BMI, data = hd)
summary(model2)
##
## Call:
## lm(formula = SBP ~ Age + Smoke + BMI, data = hd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.243 -5.609 -1.536 5.602 21.203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.0815 14.8681 3.234 0.00313 **
## Age 1.0287 0.3594 2.862 0.00787 **
## Smoke 7.1385 3.0758 2.321 0.02780 *
## BMI 1.4726 0.9579 1.537 0.13542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.314 on 28 degrees of freedom
## Multiple R-squared: 0.6988, Adjusted R-squared: 0.6665
## F-statistic: 21.65 on 3 and 28 DF, p-value: 1.859e-07
(i) The correlation coefficient between SBP and BMI is 0.74. (ii) In this case, the p-value with the inclusion of BMI to a model predicting SBP from Age and Smoke is 0.14. This would suggest that at the 5% level of significance, we cannot quite reject the null hypothesis (the population slope associated with BMI in a model with Age + Smoke is not significantly different from 0). In this case, the data suggests that adding BMI to the model is not very useful. (iii) The correlation found in part (i) does not account for Age or Smoke, it is just a measure of the strength of the relationship between SBP and BMI alone.
res <- resid(model)
fit <- fitted(model)
hist(res, col = "blue")
plot(res ~ fit)
abline(h= 0, col = "green")
This histogram shows there is a possible outlier to the right (15-20 res) and a slight skewness, but overall the shape is relatively normal. The normality condition is met.
The pattern of the scatter in the points on the scatterplot is almost random. The spread is nearly uniform is we remove outlier points from 15-20. From this, the linearity and standard deviation conditions are met.
State
Expenditure Expenditure per student in K-12 (in $1000s)
STRatio Average student/teacher ratio in K-12 AveSalary Average salary of K-12 teachers (in $1000s) PartRate Percentage of eligible students taking the SATs
SATV Average SAT Verbal score for the state
SATM Average SAT Math score for the state SATT Average Total (SATM + SATV) for the state
For the purpose of this question, SATT is the response variable. We are interested in the relationship between SATT and expenditure. Open this data set.
StateEducation <- read.csv("C:/Users/aizax94/Downloads/StateEducation.CSV")
View(StateEducation)
plot(SATT ~ Expenditure, StateEducation, main = "Average SATT scores by State Expenditure")
From this plot, the relationship between SATT scores and state Expenditure is negative, linear, and very weak.
model3 <- lm(SATT ~ Expenditure, StateEducation)
model3
##
## Call:
## lm(formula = SATT ~ Expenditure, data = StateEducation)
##
## Coefficients:
## (Intercept) Expenditure
## 1208.64334 -0.02089
For every 100 dollar decrease in state expenditure, the predicted SATT score for that state decreases by 2.1 points (or with every 1 dollar expenditure increase, predicted SATT decreases by 0.021 points). This is consistent with part (a) as the relationship is negative and not very strong.
confint(model3)
## 2.5 % 97.5 %
## (Intercept) 1036.31832770 1.380968e+03
## Expenditure -0.03562497 -6.158326e-03
We can be 95% confident he values for β1 lie within the range -0.036 and -6.16. This interval does not include 0, so it implies that β1 is different from 0.
Yes, these results are surprising. I would expect overall total SAT scores to be higher for states that have higher expenditures on education.
model4 <- lm(SATT ~ Expenditure + PartRate, StateEducation)
summary(model4)
##
## Call:
## lm(formula = SATT ~ Expenditure + PartRate, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.400 -22.877 1.967 19.142 68.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 923.652790 45.233707 20.420 < 2e-16 ***
## Expenditure 0.012285 0.004224 2.908 0.00553 **
## PartRate -2.850923 0.215116 -13.253 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
## F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## lm(formula = SATT ~ Expenditure, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.072 -46.820 4.087 40.035 128.490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.209e+03 8.571e+01 14.102 < 2e-16 ***
## Expenditure -2.089e-02 7.328e-03 -2.851 0.00641 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.91 on 48 degrees of freedom
## Multiple R-squared: 0.1448, Adjusted R-squared: 0.127
## F-statistic: 8.129 on 1 and 48 DF, p-value: 0.006406
For the model in part (b), the coefficient of determination is 0.15. For the new model including participation rate, the coefficient of determination is 0.82. This is a difference of 0.67.
After adjusting for participation rate, for every 100 dollar increase in State expenditure, predicted SATT scores increase by 1.2 points. (every 1 dollar increase in State expenditure causes predicted SATT to increase by 0.012 points)
In part b, we are not accounting for the participation rate, the percentage of eligible students taking the SAT’s, when finding the relationship between SATT scores and Expenditure. This made it seem like higher expenditures tended to have lower SATT scores.
You are to make use some of the of the variables Expenditure, STRatio, AveSalary, and PartRate to construct a model predicting SATT.
model5 <- lm(SATT ~ PartRate + Expenditure + STRatio + AveSalary, StateEducation)
summary(model5)
##
## Call:
## lm(formula = SATT ~ PartRate + Expenditure + STRatio + AveSalary,
## data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.530 -20.856 -1.747 15.982 66.572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 997.459240 79.109415 12.609 2.27e-16 ***
## PartRate -2.904491 0.231259 -12.559 2.61e-16 ***
## Expenditure 0.004457 0.010547 0.423 0.675
## STRatio -3.625234 3.215764 -1.127 0.266
## AveSalary 0.001411 0.002056 0.687 0.496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.87 on 4 and 45 DF, p-value: < 2.2e-16
model6 <- lm(SATT ~ PartRate, StateEducation)
summary(model6)
##
## Call:
## lm(formula = SATT ~ PartRate, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.158 -27.364 3.308 19.876 66.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1053.3204 8.2112 128.28 <2e-16 ***
## PartRate -2.4801 0.1862 -13.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.89 on 48 degrees of freedom
## Multiple R-squared: 0.787, Adjusted R-squared: 0.7825
## F-statistic: 177.3 on 1 and 48 DF, p-value: < 2.2e-16
Compared to the other possible predictors of SATT, PartRate has the lowest p-value. Model 6 shows that 79% of the variability in SATT can be associated with PartRate. I would conclude that since the p value is significantly less than 0.01, we can reject the null hypothesis at the 1% level of significance. This means the population slope associated with PartRate in a model is not very significantly different from 0.
model7 <- lm(SATT ~ PartRate + Expenditure, StateEducation)
summary(model7)
##
## Call:
## lm(formula = SATT ~ PartRate + Expenditure, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.400 -22.877 1.967 19.142 68.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 923.652790 45.233707 20.420 < 2e-16 ***
## PartRate -2.850923 0.215116 -13.253 < 2e-16 ***
## Expenditure 0.012285 0.004224 2.908 0.00553 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
## F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16
model8 <- lm(SATT ~ PartRate + STRatio, StateEducation)
summary(model8)
##
## Call:
## lm(formula = SATT ~ PartRate + STRatio, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.053 -23.427 3.057 18.332 58.242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1118.5087 39.4733 28.336 <2e-16 ***
## PartRate -2.5474 0.1871 -13.618 <2e-16 ***
## STRatio -3.7264 2.2089 -1.687 0.0982 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.24 on 47 degrees of freedom
## Multiple R-squared: 0.7991, Adjusted R-squared: 0.7906
## F-statistic: 93.5 on 2 and 47 DF, p-value: < 2.2e-16
model9 <- lm(SATT ~ PartRate + AveSalary, StateEducation)
summary(model9)
##
## Call:
## lm(formula = SATT ~ PartRate + AveSalary, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.314 -26.731 3.168 18.951 75.590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.572e+02 4.604e+01 20.790 <2e-16 ***
## PartRate -2.779e+00 2.285e-01 -12.163 4e-16 ***
## AveSalary 1.878e-03 8.861e-04 2.119 0.0394 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.69 on 47 degrees of freedom
## Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973
## F-statistic: 97.36 on 2 and 47 DF, p-value: < 2.2e-16
Model 7, SATT regressed on PartRate and Expenditure, has an R2 value of 0.8195. The p-value is 0.00553, which suggests that the population slope associated with Expenditure in a model with PartRate is significantly different from 0. Although the p-value is still low, it increased from the p-value in a model with just PartRate alone. This means that it is not entirely beneficial to add Expenditure to a model with PartRate.
model10 <- lm(SATT ~ PartRate + Expenditure + STRatio, StateEducation)
summary(model10)
##
## Call:
## lm(formula = SATT ~ PartRate + Expenditure + STRatio, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.284 -21.124 1.418 16.709 66.073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 972.560888 69.900796 13.913 <2e-16 ***
## PartRate -2.849108 0.215482 -13.222 <2e-16 ***
## Expenditure 0.011013 0.004452 2.474 0.0171 *
## STRatio -2.027994 2.207124 -0.919 0.3630
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.51 on 46 degrees of freedom
## Multiple R-squared: 0.8227, Adjusted R-squared: 0.8112
## F-statistic: 71.16 on 3 and 46 DF, p-value: < 2.2e-16
model11 <- lm(SATT ~ PartRate + Expenditure + AveSalary, StateEducation)
summary(model11)
##
## Call:
## lm(formula = SATT ~ PartRate + Expenditure + AveSalary, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.305 -21.775 2.109 19.157 68.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.262e+02 4.773e+01 19.407 <2e-16 ***
## PartRate -2.840e+00 2.248e-01 -12.635 <2e-16 ***
## Expenditure 1.333e-02 7.042e-03 1.893 0.0647 .
## AveSalary -2.654e-04 1.423e-03 -0.186 0.8529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.8 on 46 degrees of freedom
## Multiple R-squared: 0.8196, Adjusted R-squared: 0.8078
## F-statistic: 69.67 on 3 and 46 DF, p-value: < 2.2e-16
Model 10, with the variables PartRate, Expenditure, and SFRatio, have a R2 value of 0.8227, which is greater than model 11. The p-value associated with ST Ratio is 0.360, which means the population slope associated with STRatio in a model with Expenditure and PartRate is not (quite) significantly different from 0. In this case, adding STRatio to a model with PartRate and Expenditure is probably not worthwhile.
summary(model7)
##
## Call:
## lm(formula = SATT ~ PartRate + Expenditure, data = StateEducation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.400 -22.877 1.967 19.142 68.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 923.652790 45.233707 20.420 < 2e-16 ***
## PartRate -2.850923 0.215116 -13.253 < 2e-16 ***
## Expenditure 0.012285 0.004224 2.908 0.00553 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
## F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16
The residual standard error for this model is 32.46. This implies that a typical SATT score is roughly 32.46 points from the regression line.