The data gives the systolic blood pressure (SBP), body size (QUET), age (AGE), and smoking history (SMK = 0 if nonsmoker, SMK = 1 if a current or previous smoker) for a hypothetical sample of 32 white males over 40 years old from the town of Angina.
data = read.csv("week2-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)
library(ggplot2)
library(gridExtra)
library(ggthemes)
p1 = ggplot(data, aes(SBP, QUET)) + geom_point() + ggtitle("SBP (Y) vs. QUET (X)") +
theme_stata() + scale_color_stata()
p2 = ggplot(data, aes(SBP, SMK)) + geom_point() + ggtitle("SBP (Y) vs. SMK (X)") +
theme_stata() + scale_color_stata()
p3 = ggplot(data, aes(QUET, AGE)) + geom_point() + ggtitle("QUET (Y) vs. AGE (X)") +
theme_stata() + scale_color_stata()
p4 = ggplot(data, aes(SBP, AGE)) + geom_point() + ggtitle("SBP (Y) vs. AGE (X)") +
theme_stata() + scale_color_stata()
marrangeGrob(list(p1, p2, p3, p4), nrow = 2, ncol = 2, top = " ")
Using scatter diagrams #1, #3, and #4 that you generated above, use paper and pencil to roughly sketch a line that fits the data reasonably well. Use the homework forum to share your sketches and comment on the relationships described.
p1 = ggplot(data, aes(SBP, QUET)) + geom_point() + ggtitle("SBP (Y) vs. QUET (X)") +
geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
p3 = ggplot(data, aes(QUET, AGE)) + geom_point() + ggtitle("QUET (Y) vs. AGE (X)") +
geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
p4 = ggplot(data, aes(SBP, AGE)) + geom_point() + ggtitle("SBP (Y) vs. AGE (X)") +
geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
marrangeGrob(list(p1, p3, p4), nrow = 2, ncol = 2, top = " ")
We can see some relationships (possible positive correlation) between “SBP (Y) vs. QUET (X)”, “QUET (Y) vs. AGE (X)” and “SBP (Y) vs. AGE (X)” but not in SBP (Y) vs. SMK (X) because scatter plot is inappropriate for exploring relationship for this variable types.
data = transform(data, SMK = factor(SMK, levels = 0:1, c("Nonsmoker", "Smoker")))
ggplot(data, aes(SMK, SBP)) + geom_boxplot() + ggtitle("SBP (Y) vs. SMK (X)") +
theme_stata() + scale_color_stata()
By taking the simple linear regression of SBP on SMK in R, we obtain:
y = lm( SBP ~ SMK)
summary(y)
##
## Call:
## lm(formula = SBP ~ SMK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.824 -9.056 -2.812 11.200 32.176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140.800 3.661 38.454 <2e-16 ***
## SMK 7.024 5.023 1.398 0.172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.18 on 30 degrees of freedom
## Multiple R-squared: 0.06117, Adjusted R-squared: 0.02988
## F-statistic: 1.955 on 1 and 30 DF, p-value: 0.1723
So, the least squares estimates for theses parameters are:
\[\hat y = 140.800 + 7.024\times SMK\]
or \(\hat \beta_1 = 7.024\) and intercept \(\hat \beta_0 = 140.800\).
In this case, we can to say that under this model smokers have higher average SBP than nonsmokers (about 7.024 mm Hg higher).
library(reshape2)
data1 = melt(data, id = "SMK", measure.vars = "SBP")
aggregate(value ~ SMK, data1, mean)
## SMK value
## 1 Nonsmoker 140.8000
## 2 Smoker 147.8235
As expected, the value of \(\hat \beta_0\) is average of SBP for nonsmokers and \(\hat \beta_0 + \hat \beta_1\) is average of SBP for smokers. The results are compatible with answer of question (1) above.
anova(y)
## Analysis of Variance Table
##
## Response: SBP
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK 1 393.1 393.1 1.9548 0.1723
## Residuals 30 6032.9 201.1
If we take a confidence level of 5%, we can not reject the null hypothesis (\(H_0: \beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 17.23% (greater than 5% previously admitted).
t.test(SBP ~ SMK, data = data, alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = TRUE, conf.level = 0.95)
##
## Two Sample t-test
##
## data: SBP by SMK
## t = -1.3981, df = 30, p-value = 0.1723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -17.282882 3.235823
## sample estimates:
## mean in group Nonsmoker mean in group Smoker
## 140.8000 147.8235
We can see that this test is the same as the question (3).
y = lm( SBP ~ QUET)
summary(y)
##
## Call:
## lm(formula = SBP ~ QUET)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.231 -7.145 -1.604 7.798 22.531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.576 12.322 5.728 2.99e-06 ***
## QUET 21.492 3.545 6.062 1.17e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.812 on 30 degrees of freedom
## Multiple R-squared: 0.5506, Adjusted R-squared: 0.5356
## F-statistic: 36.75 on 1 and 30 DF, p-value: 1.172e-06
So, the least squares estimates for theses parameters are:
\[\hat y = 70.576 + 21.492\times QUET\]
or \(\hat \beta_1 = 21.492\) and intercept \(\hat \beta_0 = 70.576\).
ggplot(data, aes(SBP, QUET)) + geom_point() + ggtitle("SBP (Y) vs. QUET (X)") +
geom_smooth(method = "lm") + theme_stata() + scale_color_stata()
If we take a confidence level of 5%, we can reject the null hypothesis (\(H_0: \beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).
We are 95% confident that the true value for the mean of SBP is between \(140.989\) and \(148.074\).
coef = summary(y)$coefficients
coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.57640 12.321868 5.727736 2.990959e-06
## QUET 21.49167 3.545147 6.062279 1.171656e-06
ymean = coef[1,1] + coef[2,1] * 3.441094
ymean
## [1] 144.5313
summary(y)$sigma
## [1] 9.811597
length(QUET)
## [1] 32
s = summary(y)$sigma*sqrt((1/length(QUET)))
s
## [1] 1.734462
y$df
## [1] 30
ymean + c(-1, 1) * qt(.975, df = y$df) * s
## [1] 140.9890 148.0735
newQUET = data.frame(QUET = seq(min(QUET), max(QUET), length = 100))
p1 = data.frame(predict(y, newdata = newQUET, interval = ("confidence")))
p2 = data.frame(predict(y, newdata = newQUET, interval = ("prediction")))
p1$Interval = "confidence"
p2$Interval = "prediction"
p1$x = newQUET$QUET
p2$x = newQUET$QUET
dat = rbind(p1, p2)
names(dat)[1] = "y"
ggplot(dat, aes(x = x, y = y)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = Interval), alpha = 0.2) +
geom_line() +
geom_point(data = data, aes(x = QUET, y = SBP), size = 4) +
theme_stata() +
scale_color_stata()
Yes, because we can see a possible positive linear correlation.
Plotting of the residuals:
par(mfrow = c(3,2))
plot(y, which = c(1:6) )
par(mfrow = c(1,1))
Tests:
library(car)
library(lmtest)
# Normality of the errors: H_0: residuals come from normal population
shapiro.test(resid(y))
##
## Shapiro-Wilk normality test
##
## data: resid(y)
## W = 0.97006, p-value = 0.5011
# Independence of the errors: H_0: rho = 0
durbinWatsonTest(y)
## lag Autocorrelation D-W Statistic p-value
## 1 0.3113661 1.356743 0.088
## Alternative hypothesis: rho != 0
# Tests for homoscedasticity errors: H_0: equal variances
bptest(y)
##
## studentized Breusch-Pagan test
##
## data: y
## BP = 0.24972, df = 1, p-value = 0.6173
# Outliers:
outlierTest(y)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 9 2.816464 0.0086454 0.27665
dfbeta(y) # change in coefficients when the i^th point is deleted in the model
## (Intercept) QUET
## 1 0.80379843 -0.20798584
## 2 -2.23196724 0.47485745
## 3 -1.39309620 0.33626507
## 4 0.43070094 -0.15899593
## 5 2.89391559 -0.73095381
## 6 -0.54519315 0.14314306
## 7 -0.92523518 0.38774645
## 8 -0.55656321 0.27275428
## 9 14.13235405 -3.85692084
## 10 -6.31947756 1.94989034
## 11 -2.10974677 0.72951813
## 12 4.87898754 -1.60703812
## 13 2.10437218 -0.68181444
## 14 0.87327000 -0.36251360
## 15 0.17010043 -0.11640067
## 16 -0.24625685 0.06185327
## 17 0.15464852 -0.02419104
## 18 1.48661066 -0.37026585
## 19 -0.59279072 0.13698784
## 20 -0.08491781 0.00902303
## 21 -0.07937857 0.03674212
## 22 0.81207538 -0.30429459
## 23 -0.44053069 0.08653431
## 24 -1.06235085 0.23729682
## 25 0.72913278 -0.14158741
## 26 -0.80162023 0.20012941
## 27 -3.73345105 0.98038048
## 28 -2.15321136 0.54725946
## 29 -1.19554192 0.43095570
## 30 -3.27046782 1.05681595
## 31 0.80956287 -0.27152170
## 32 -1.75348537 0.58055748
dffits(y) # change in Y when the i^th point is deleted in the model
## 1 2 3 4 5 6
## 0.07639412 -0.38811946 -0.16462184 -0.07955174 0.30255556 -0.04967946
## 7 8 9 10 11 12
## 0.26320953 0.23532693 1.32665983 0.59701921 0.31264791 -0.62348198
## 13 14 15 16 17 18
## -0.23578136 -0.24053612 -0.13580106 -0.02555904 0.04106457 0.15949527
## 19 20 21 22 23 24
## -0.07880610 -0.03065351 0.02856414 -0.15920602 -0.08470319 -0.15570088
## 25 26 27 28 29 30
## 0.14409553 -0.08491904 -0.34727101 -0.21822483 0.20493954 0.36713110
## 31 32
## -0.10357553 0.21447115
If we take a confidence level of 5%, we haven’t significative evidence for reject the null hypothesis (\(H_0\)) for normality, independence and homoscedasticity of the residuals (the probability of error in rejecting the true null hypothesis would be higher than 5% previously admitted). The scatter diagrams doesn’t suggest problem with de linearity. The plotting of residuals and tests for outliers show us possible problem with extreme points (observation 9).
detach(data)