Homework2 - Applied Regression Analysis

The Data

The data gives the systolic blood pressure (SBP), body size (QUET), age (AGE), and smoking history (SMK = 0 if nonsmoker, SMK = 1 if a current or previous smoker) for a hypothetical sample of 32 white males over 40 years old from the town of Angina.

data = read.csv("week2-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)

Exercise One:

Generate scatter diagrams for each of the following variable pairs:

SBP (Y) vs. QUET (X)
SBP (Y) vs. SMK (X)
QUET (Y) vs. AGE (X)
SBP (Y) vs. AGE (X)

Solving

library(ggplot2)
library(gridExtra)
library(ggthemes)
p1 = ggplot(data, aes(SBP, QUET)) + geom_point() + ggtitle("SBP (Y) vs. QUET (X)") + 
        theme_stata() + scale_color_stata()
p2 = ggplot(data, aes(SBP, SMK)) + geom_point() + ggtitle("SBP (Y) vs. SMK (X)") + 
        theme_stata() + scale_color_stata()
p3 = ggplot(data, aes(QUET, AGE)) + geom_point() + ggtitle("QUET (Y) vs. AGE (X)") + 
        theme_stata() + scale_color_stata()
p4 = ggplot(data, aes(SBP, AGE)) + geom_point() + ggtitle("SBP (Y) vs. AGE (X)") + 
        theme_stata() + scale_color_stata()
marrangeGrob(list(p1, p2, p3, p4), nrow = 2, ncol = 2, top = " ")

Exercise Two:

Sketch a line

Using scatter diagrams #1, #3, and #4 that you generated above, use paper and pencil to roughly sketch a line that fits the data reasonably well. Use the homework forum to share your sketches and comment on the relationships described.

Solving

p1 = ggplot(data, aes(SBP, QUET)) + geom_point() + ggtitle("SBP (Y) vs. QUET (X)") +
        geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
p3 = ggplot(data, aes(QUET, AGE)) + geom_point() + ggtitle("QUET (Y) vs. AGE (X)") +
        geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
p4 = ggplot(data, aes(SBP, AGE)) + geom_point() + ggtitle("SBP (Y) vs. AGE (X)") +
        geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
marrangeGrob(list(p1, p3, p4), nrow = 2, ncol = 2, top = " ")

We can see some relationships (possible positive correlation) between “SBP (Y) vs. QUET (X)”, “QUET (Y) vs. AGE (X)” and “SBP (Y) vs. AGE (X)” but not in SBP (Y) vs. SMK (X) because scatter plot is inappropriate for exploring relationship for this variable types.

Exercise Three:

Comparing Blood Pressure with Smoking History

Determine the least-squares estimates of slope \(\beta_1\) and intercept \(\beta_0\) for the straight-line regression of SBP (Y) on SMK (X).
Compare the value of \(\hat \beta_0\) with the mean SBP for nonsmokers. Compare the value of \(\hat \beta_0+\hat \beta_1\) with the mean SBP for smokers. Use the homework forum to explain and share the results of these comparisons.
Test the hypothesis that the true slope \(\beta_1\) is \(0\).
Is the test in question (3) equivalent to the usual two-sample t test for the equality of two population means assuming equal but unknown variances? Demonstrate your answer numerically and share your findings in the homework forum.

Solving

A preview graphic exploration shows possible increase in SBP for smokers.

data = transform(data, SMK = factor(SMK, levels = 0:1, c("Nonsmoker", "Smoker")))
ggplot(data, aes(SMK, SBP)) + geom_boxplot() + ggtitle("SBP (Y) vs. SMK (X)") + 
        theme_stata() + scale_color_stata()

By taking the simple linear regression of SBP on SMK in R, we obtain:

y = lm( SBP ~ SMK)
summary(y)

## 
## Call:
## lm(formula = SBP ~ SMK)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.824  -9.056  -2.812  11.200  32.176 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  140.800      3.661  38.454   <2e-16 ***
## SMK            7.024      5.023   1.398    0.172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.18 on 30 degrees of freedom
## Multiple R-squared:  0.06117,    Adjusted R-squared:  0.02988 
## F-statistic: 1.955 on 1 and 30 DF,  p-value: 0.1723

So, the least squares estimates for theses parameters are:

\[\hat y = 140.800 + 7.024\times SMK\]

or \(\hat \beta_1 = 7.024\) and intercept \(\hat \beta_0 = 140.800\).

In this case, we can to say that under this model smokers have higher average SBP than nonsmokers (about 7.024 mm Hg higher).

library(reshape2)
data1 = melt(data, id = "SMK", measure.vars = "SBP")
aggregate(value ~ SMK, data1, mean)

##         SMK    value
## 1 Nonsmoker 140.8000
## 2    Smoker 147.8235

As expected, the value of \(\hat \beta_0\) is average of SBP for nonsmokers and \(\hat \beta_0 + \hat \beta_1\) is average of SBP for smokers. The results are compatible with answer of question (1) above.

The output of the R for the summary for simple regression of SBP on SMK shows this test but we can obtain the same test with Analysis of Variance Table:

anova(y)

## Analysis of Variance Table
## 
## Response: SBP
##           Df Sum Sq Mean Sq F value Pr(>F)
## SMK        1  393.1   393.1  1.9548 0.1723
## Residuals 30 6032.9   201.1

If we take a confidence level of 5%, we can not reject the null hypothesis (\(H_0: \beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 17.23% (greater than 5% previously admitted).

In R, the usual two-sample t test for the equality of two population means assuming equal but unknown variances:

t.test(SBP ~ SMK, data = data, alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = TRUE, conf.level = 0.95)

## 
##  Two Sample t-test
## 
## data:  SBP by SMK
## t = -1.3981, df = 30, p-value = 0.1723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17.282882   3.235823
## sample estimates:
## mean in group Nonsmoker    mean in group Smoker 
##                140.8000                147.8235

We can see that this test is the same as the question (3).

Exercise Four:

Comparing Blood Pressure with Body Size

Determine the least-squares estimates of slope and intercept for the straight-line regression of SBP (Y) on QUET (X).
Sketch the estimated regression line on the scatter diagram involving SBP and QUET.
Test the hypothesis of zero slope.
Find a 95% confidence interval for \(\mu_{y\mid\bar{x}}\).
Calculate 95% prediction bands.
Based on the above, would you conclude that blood pressure increases as body size increases? Share your explanation and reasoning in the homework forum.
Are any of the assumptions for straight-line regression clearly not satisfied in this example? Explain in the homework forum.

Solving

By taking the simple linear regression of SBP on QUET in R, we obtain:

y = lm( SBP ~ QUET)
summary(y)

## 
## Call:
## lm(formula = SBP ~ QUET)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.231  -7.145  -1.604   7.798  22.531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   70.576     12.322   5.728 2.99e-06 ***
## QUET          21.492      3.545   6.062 1.17e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.812 on 30 degrees of freedom
## Multiple R-squared:  0.5506, Adjusted R-squared:  0.5356 
## F-statistic: 36.75 on 1 and 30 DF,  p-value: 1.172e-06

So, the least squares estimates for theses parameters are:

\[\hat y = 70.576 + 21.492\times QUET\]

or \(\hat \beta_1 = 21.492\) and intercept \(\hat \beta_0 = 70.576\).

ggplot(data, aes(SBP, QUET)) + geom_point() + ggtitle("SBP (Y) vs. QUET (X)") + 
        geom_smooth(method = "lm") + theme_stata() + scale_color_stata()

If we take a confidence level of 5%, we can reject the null hypothesis (\(H_0: \beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).
We are 95% confident that the true value for the mean of SBP is between \(140.989\) and \(148.074\).

coef = summary(y)$coefficients
coef

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 70.57640  12.321868 5.727736 2.990959e-06
## QUET        21.49167   3.545147 6.062279 1.171656e-06

ymean = coef[1,1] + coef[2,1] * 3.441094
ymean

## [1] 144.5313

summary(y)$sigma

## [1] 9.811597

length(QUET)

## [1] 32

s = summary(y)$sigma*sqrt((1/length(QUET)))
s

## [1] 1.734462

y$df

## [1] 30

ymean + c(-1, 1) * qt(.975, df = y$df) * s

## [1] 140.9890 148.0735

In R:

newQUET = data.frame(QUET = seq(min(QUET), max(QUET), length = 100))
p1 = data.frame(predict(y, newdata = newQUET, interval = ("confidence")))
p2 = data.frame(predict(y, newdata = newQUET, interval = ("prediction")))
p1$Interval = "confidence"
p2$Interval = "prediction"
p1$x = newQUET$QUET
p2$x = newQUET$QUET
dat = rbind(p1, p2)
names(dat)[1] = "y"

ggplot(dat, aes(x = x, y = y)) + 
        geom_ribbon(aes(ymin = lwr, ymax = upr, fill = Interval), alpha = 0.2) +
        geom_line() + 
        geom_point(data = data, aes(x = QUET, y = SBP), size = 4) +
        theme_stata() + 
        scale_color_stata()

Yes, because we can see a possible positive linear correlation.
Plotting of the residuals:

par(mfrow = c(3,2))
plot(y, which =  c(1:6) )

par(mfrow = c(1,1))

Tests:

library(car)
library(lmtest)
# Normality of the errors: H_0: residuals come from normal population
shapiro.test(resid(y))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(y)
## W = 0.97006, p-value = 0.5011

# Independence of the errors: H_0: rho = 0
durbinWatsonTest(y)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.3113661      1.356743   0.088
##  Alternative hypothesis: rho != 0

# Tests for homoscedasticity errors:  H_0: equal variances
bptest(y)

## 
##  studentized Breusch-Pagan test
## 
## data:  y
## BP = 0.24972, df = 1, p-value = 0.6173

# Outliers:
outlierTest(y)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##   rstudent unadjusted p-value Bonferonni p
## 9 2.816464          0.0086454      0.27665

dfbeta(y) # change in coefficients when the i^th point is deleted in the model

##    (Intercept)        QUET
## 1   0.80379843 -0.20798584
## 2  -2.23196724  0.47485745
## 3  -1.39309620  0.33626507
## 4   0.43070094 -0.15899593
## 5   2.89391559 -0.73095381
## 6  -0.54519315  0.14314306
## 7  -0.92523518  0.38774645
## 8  -0.55656321  0.27275428
## 9  14.13235405 -3.85692084
## 10 -6.31947756  1.94989034
## 11 -2.10974677  0.72951813
## 12  4.87898754 -1.60703812
## 13  2.10437218 -0.68181444
## 14  0.87327000 -0.36251360
## 15  0.17010043 -0.11640067
## 16 -0.24625685  0.06185327
## 17  0.15464852 -0.02419104
## 18  1.48661066 -0.37026585
## 19 -0.59279072  0.13698784
## 20 -0.08491781  0.00902303
## 21 -0.07937857  0.03674212
## 22  0.81207538 -0.30429459
## 23 -0.44053069  0.08653431
## 24 -1.06235085  0.23729682
## 25  0.72913278 -0.14158741
## 26 -0.80162023  0.20012941
## 27 -3.73345105  0.98038048
## 28 -2.15321136  0.54725946
## 29 -1.19554192  0.43095570
## 30 -3.27046782  1.05681595
## 31  0.80956287 -0.27152170
## 32 -1.75348537  0.58055748

dffits(y) # change in Y when the i^th point is deleted in the model

##           1           2           3           4           5           6 
##  0.07639412 -0.38811946 -0.16462184 -0.07955174  0.30255556 -0.04967946 
##           7           8           9          10          11          12 
##  0.26320953  0.23532693  1.32665983  0.59701921  0.31264791 -0.62348198 
##          13          14          15          16          17          18 
## -0.23578136 -0.24053612 -0.13580106 -0.02555904  0.04106457  0.15949527 
##          19          20          21          22          23          24 
## -0.07880610 -0.03065351  0.02856414 -0.15920602 -0.08470319 -0.15570088 
##          25          26          27          28          29          30 
##  0.14409553 -0.08491904 -0.34727101 -0.21822483  0.20493954  0.36713110 
##          31          32 
## -0.10357553  0.21447115

If we take a confidence level of 5%, we haven’t significative evidence for reject the null hypothesis (\(H_0\)) for normality, independence and homoscedasticity of the residuals (the probability of error in rejecting the true null hypothesis would be higher than 5% previously admitted). The scatter diagrams doesn’t suggest problem with de linearity. The plotting of residuals and tests for outliers show us possible problem with extreme points (observation 9).

detach(data)