Assignment M9

Name(s): Aiza K

Please open the hd data set in R.

hd <- read.csv("C:/Users/aizax94/Downloads/hd.csv")
   View(hd)

Obtain a model predicting SBP from Age and Smoke. Interpret the two slopes in this case. Interpet the value for r2 in this case

model <- lm(SBP ~ Age + Smoke, hd)
model

## 
## Call:
## lm(formula = SBP ~ Age + Smoke, data = hd)
## 
## Coefficients:
## (Intercept)          Age        Smoke  
##      62.389        1.464        7.882

summary(model)

## 
## Call:
## lm(formula = SBP ~ Age + Smoke, data = hd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.963  -5.596  -1.839   4.886  19.460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62.3892    11.8647   5.258 1.24e-05 ***
## Age           1.4639     0.2265   6.462 4.52e-07 ***
## Smoke         7.8818     3.1082   2.536   0.0169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.507 on 29 degrees of freedom
## Multiple R-squared:  0.6734, Adjusted R-squared:  0.6508 
## F-statistic: 29.89 on 2 and 29 DF,  p-value: 8.992e-08

After adjusting for Age, for someone who smokes predicted SBP increases by 7.88. After adjusting for Smoke, for every 1 year increase in age predicted SBP increases by 1.46.

In this case, the value for r-squared is 0.67. This means 67% of the variability in SBP can be associated with the linear relationship between SBP with Age and Smoke.

(b)Use your model in part (a) to predict the mean SBP for 53-year-old non-smokers and the mean SBP for 65-year-old smokers. Obtain a 90% CI for the the mean SBP for 53-year-old non-smokers and for the mean SBP for 65-year-old smokers. Can you interpret this second interval?

k <- data.frame(Age = 53, Smoke = 0)
p <- predict(model, newdata = k, interval = "confidence", level = 0.9)
round(p, 2)

##      fit   lwr    upr
## 1 139.98 136.2 143.76

We can be 90% confident that the mean SBP for 53 year-old non- smokers lies between 136.2 and 143.8 mm.

k2 <- data.frame(Age = 65, Smoke = 1)
p2 <- predict(model, newdata = k2, interval = "confidence", level = 0.9)
round(p2, 2)

##      fit    lwr    upr
## 1 165.43 160.17 170.68

We can be 90% confident that the mean SBP for 65 year-old smokers lies between 160.2 and 170.7 mm.

1. What is the correlation coefficient (r) between SBP and BMI? (ii) Use a t test to check whether it is worth adding the variable BMI to a model predicting SBP from Age and Smoke. What do you conclude? (iii) Can you explain the apparent contradiction between your results in part (i) and part (ii)?

cor(hd$SBP, hd$BMI)

## [1] 0.7409754

model2 <- lm(SBP ~ Age + Smoke + BMI, data = hd)
summary(model2)

## 
## Call:
## lm(formula = SBP ~ Age + Smoke + BMI, data = hd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.243  -5.609  -1.536   5.602  21.203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  48.0815    14.8681   3.234  0.00313 **
## Age           1.0287     0.3594   2.862  0.00787 **
## Smoke         7.1385     3.0758   2.321  0.02780 * 
## BMI           1.4726     0.9579   1.537  0.13542   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.314 on 28 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6665 
## F-statistic: 21.65 on 3 and 28 DF,  p-value: 1.859e-07

(i) The correlation coefficient between SBP and BMI is 0.74. (ii) In this case, the p-value with the inclusion of BMI to a model predicting SBP from Age and Smoke is 0.14. This would suggest that at the 5% level of significance, we cannot quite reject the null hypothesis (the population slope associated with BMI in a model with Age + Smoke is not significantly different from 0). In this case, the data suggests that adding BMI to the model is not very useful. (iii) The correlation found in part (i) does not account for Age or Smoke, it is just a measure of the strength of the relationship between SBP and BMI alone.

For your model in part (a) obtain a histogram of the residuals with a superimposed Normal curve to check the Normality condition. Finally, obtain a plot of the residuals against Age in order to check the linearity and the equal standard deviation conditions. Outline your conclusions.

res <- resid(model)
fit <- fitted(model)
hist(res, col = "blue")

plot(res ~ fit)
abline(h= 0, col = "green")

This histogram shows there is a possible outlier to the right (15-20 res) and a slight skewness, but overall the shape is relatively normal. The normality condition is met.

The pattern of the scatter in the points on the scatterplot is almost random. The spread is nearly uniform is we remove outlier points from 15-20. From this, the linearity and standard deviation conditions are met.

The data set StateEducation.sav contains the following data for each of the 50 states.

State
Expenditure Expenditure per student in K-12 (in $1000s)
STRatio Average student/teacher ratio in K-12 AveSalary Average salary of K-12 teachers (in $1000s) PartRate Percentage of eligible students taking the SATs
SATV Average SAT Verbal score for the state
SATM Average SAT Math score for the state SATT Average Total (SATM + SATV) for the state

For the purpose of this question, SATT is the response variable. We are interested in the relationship between SATT and expenditure. Open this data set.

StateEducation <- read.csv("C:/Users/aizax94/Downloads/StateEducation.CSV")
 View(StateEducation)

Plot SATT against Expenditure. What does the plot suggest about the relationship between these two variables?

plot(SATT ~ Expenditure, StateEducation, main = "Average SATT scores by State Expenditure")

From this plot, the relationship between SATT scores and state Expenditure is negative, linear, and very weak.

Regress SATT on expenditure. Interpret the slope in this case. Is your interpretation consistent with your answer to part (a)? Perform the appropriate t test.

model3 <- lm(SATT ~ Expenditure, StateEducation)
model3

## 
## Call:
## lm(formula = SATT ~ Expenditure, data = StateEducation)
## 
## Coefficients:
## (Intercept)  Expenditure  
##  1208.64334     -0.02089

For every 100 dollar decrease in state expenditure, the predicted SATT score for that state decreases by 2.1 points (or with every 1 dollar expenditure increase, predicted SATT decreases by 0.021 points). This is consistent with part (a) as the relationship is negative and not very strong.

confint(model3)

##                     2.5 %        97.5 %
## (Intercept) 1036.31832770  1.380968e+03
## Expenditure   -0.03562497 -6.158326e-03

We can be 95% confident he values for β1 lie within the range -0.036 and -6.16. This interval does not include 0, so it implies that β1 is different from 0.

As a layperson, do your findings in parts (a) and (b) surprise you?

Yes, these results are surprising. I would expect overall total SAT scores to be higher for states that have higher expenditures on education.

Regress SATT on expenditure and participation rate. Has the coefficient of determination increased from your model in part (b)? By how much?

model4 <- lm(SATT ~ Expenditure + PartRate, StateEducation)
summary(model4)

## 
## Call:
## lm(formula = SATT ~ Expenditure + PartRate, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.400 -22.877   1.967  19.142  68.753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 923.652790  45.233707  20.420  < 2e-16 ***
## Expenditure   0.012285   0.004224   2.908  0.00553 ** 
## PartRate     -2.850923   0.215116 -13.253  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.8118 
## F-statistic: 106.7 on 2 and 47 DF,  p-value: < 2.2e-16

summary(model3)

## 
## Call:
## lm(formula = SATT ~ Expenditure, data = StateEducation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.072  -46.820    4.087   40.035  128.490 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.209e+03  8.571e+01  14.102  < 2e-16 ***
## Expenditure -2.089e-02  7.328e-03  -2.851  0.00641 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.91 on 48 degrees of freedom
## Multiple R-squared:  0.1448, Adjusted R-squared:  0.127 
## F-statistic: 8.129 on 1 and 48 DF,  p-value: 0.006406

For the model in part (b), the coefficient of determination is 0.15. For the new model including participation rate, the coefficient of determination is 0.82. This is a difference of 0.67.

Interpret the slope associated with expenditure in this case.

After adjusting for participation rate, for every 100 dollar increase in State expenditure, predicted SATT scores increase by 1.2 points. (every 1 dollar increase in State expenditure causes predicted SATT to increase by 0.012 points)

Can you explain the apparent contradiction between your interpretation in part (b) and your interpretation in part (e)?

In part b, we are not accounting for the participation rate, the percentage of eligible students taking the SAT’s, when finding the relationship between SATT scores and Expenditure. This made it seem like higher expenditures tended to have lower SATT scores.

Open the data set StateEducation.sav again. Again, the response variable is SATT. Your task is to use the Goldman Procedure to arrive at a model predicting SATT from the potential predictor variables, Expenditure, STRatio, AveSalary, and PartRate. There is no need to include all the SPSS output but do present a summary table similar to the one in my notes.

You are to make use some of the of the variables Expenditure, STRatio, AveSalary, and PartRate to construct a model predicting SATT.

Verify that PartRate is the best single predictor of SATT. How much of the variability in SATT can be associated with PartRate? What is the p-value associated with the t test of the the null hypothesis that the population slope for PartRate is 0. What do you conclude?

model5 <- lm(SATT ~ PartRate + Expenditure + STRatio + AveSalary, StateEducation)
summary(model5)

## 
## Call:
## lm(formula = SATT ~ PartRate + Expenditure + STRatio + AveSalary, 
##     data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.530 -20.856  -1.747  15.982  66.572 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 997.459240  79.109415  12.609 2.27e-16 ***
## PartRate     -2.904491   0.231259 -12.559 2.61e-16 ***
## Expenditure   0.004457   0.010547   0.423    0.675    
## STRatio      -3.625234   3.215764  -1.127    0.266    
## AveSalary     0.001411   0.002056   0.687    0.496    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.87 on 4 and 45 DF,  p-value: < 2.2e-16

model6 <- lm(SATT ~ PartRate, StateEducation)
summary(model6)

## 
## Call:
## lm(formula = SATT ~ PartRate, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.158 -27.364   3.308  19.876  66.080 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1053.3204     8.2112  128.28   <2e-16 ***
## PartRate      -2.4801     0.1862  -13.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.89 on 48 degrees of freedom
## Multiple R-squared:  0.787,  Adjusted R-squared:  0.7825 
## F-statistic: 177.3 on 1 and 48 DF,  p-value: < 2.2e-16

Compared to the other possible predictors of SATT, PartRate has the lowest p-value. Model 6 shows that 79% of the variability in SATT can be associated with PartRate. I would conclude that since the p value is significantly less than 0.01, we can reject the null hypothesis at the 1% level of significance. This means the population slope associated with PartRate in a model is not very significantly different from 0.

Regress SATT on the three pairs of variables (i) PartRate and Expenditure, (ii) PartRate and SFRatio, and (iii) PartRate and AveSal. Show that the first pair have the highest R2. What is that value? Perform the t test of the benefit of adding Expenditure to a model with PartRate alone. What do you conclude?

model7 <- lm(SATT ~ PartRate + Expenditure, StateEducation)
summary(model7)

## 
## Call:
## lm(formula = SATT ~ PartRate + Expenditure, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.400 -22.877   1.967  19.142  68.753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 923.652790  45.233707  20.420  < 2e-16 ***
## PartRate     -2.850923   0.215116 -13.253  < 2e-16 ***
## Expenditure   0.012285   0.004224   2.908  0.00553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.8118 
## F-statistic: 106.7 on 2 and 47 DF,  p-value: < 2.2e-16

model8 <- lm(SATT ~ PartRate + STRatio, StateEducation)
summary(model8)

## 
## Call:
## lm(formula = SATT ~ PartRate + STRatio, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.053 -23.427   3.057  18.332  58.242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1118.5087    39.4733  28.336   <2e-16 ***
## PartRate      -2.5474     0.1871 -13.618   <2e-16 ***
## STRatio       -3.7264     2.2089  -1.687   0.0982 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.24 on 47 degrees of freedom
## Multiple R-squared:  0.7991, Adjusted R-squared:  0.7906 
## F-statistic:  93.5 on 2 and 47 DF,  p-value: < 2.2e-16

model9 <- lm(SATT ~ PartRate + AveSalary, StateEducation)
summary(model9)

## 
## Call:
## lm(formula = SATT ~ PartRate + AveSalary, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.314 -26.731   3.168  18.951  75.590 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.572e+02  4.604e+01  20.790   <2e-16 ***
## PartRate    -2.779e+00  2.285e-01 -12.163    4e-16 ***
## AveSalary    1.878e-03  8.861e-04   2.119   0.0394 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.69 on 47 degrees of freedom
## Multiple R-squared:  0.8056, Adjusted R-squared:  0.7973 
## F-statistic: 97.36 on 2 and 47 DF,  p-value: < 2.2e-16

Model 7, SATT regressed on PartRate and Expenditure, has an R2 value of 0.8195. The p-value is 0.00553, which suggests that the population slope associated with Expenditure in a model with PartRate is significantly different from 0. Although the p-value is still low, it increased from the p-value in a model with just PartRate alone. This means that it is not entirely beneficial to add Expenditure to a model with PartRate.

Regress SATT on the two triples of variables, (i) PartRate, Expenditure, and SFRatio, and (ii) PartRate, Expenditure, and AveSal. Show that the first triple have the highest R2. What is the value Perform the t test of the benefit of adding SFRatio to a model with PartRate and Expenditure. What do you conclude?

model10 <- lm(SATT ~ PartRate + Expenditure + STRatio, StateEducation)
summary(model10)

## 
## Call:
## lm(formula = SATT ~ PartRate + Expenditure + STRatio, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.284 -21.124   1.418  16.709  66.073 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 972.560888  69.900796  13.913   <2e-16 ***
## PartRate     -2.849108   0.215482 -13.222   <2e-16 ***
## Expenditure   0.011013   0.004452   2.474   0.0171 *  
## STRatio      -2.027994   2.207124  -0.919   0.3630    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.51 on 46 degrees of freedom
## Multiple R-squared:  0.8227, Adjusted R-squared:  0.8112 
## F-statistic: 71.16 on 3 and 46 DF,  p-value: < 2.2e-16

model11 <- lm(SATT ~ PartRate + Expenditure + AveSalary, StateEducation)
summary(model11)

## 
## Call:
## lm(formula = SATT ~ PartRate + Expenditure + AveSalary, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -89.305 -21.775   2.109  19.157  68.389 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.262e+02  4.773e+01  19.407   <2e-16 ***
## PartRate    -2.840e+00  2.248e-01 -12.635   <2e-16 ***
## Expenditure  1.333e-02  7.042e-03   1.893   0.0647 .  
## AveSalary   -2.654e-04  1.423e-03  -0.186   0.8529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.8 on 46 degrees of freedom
## Multiple R-squared:  0.8196, Adjusted R-squared:  0.8078 
## F-statistic: 69.67 on 3 and 46 DF,  p-value: < 2.2e-16

Model 10, with the variables PartRate, Expenditure, and SFRatio, have a R2 value of 0.8227, which is greater than model 11. The p-value associated with ST Ratio is 0.360, which means the population slope associated with STRatio in a model with Expenditure and PartRate is not (quite) significantly different from 0. In this case, adding STRatio to a model with PartRate and Expenditure is probably not worthwhile.

For the two variable model predicting SATT from PartRate and Expenditure, obtain and interpret the residual standard error.

summary(model7)

## 
## Call:
## lm(formula = SATT ~ PartRate + Expenditure, data = StateEducation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.400 -22.877   1.967  19.142  68.753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 923.652790  45.233707  20.420  < 2e-16 ***
## PartRate     -2.850923   0.215116 -13.253  < 2e-16 ***
## Expenditure   0.012285   0.004224   2.908  0.00553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.8118 
## F-statistic: 106.7 on 2 and 47 DF,  p-value: < 2.2e-16

The residual standard error for this model is 32.46. This implies that a typical SATT score is roughly 32.46 points from the regression line.

Assignment M9

Aiza K

April 6, 2016