Complete the regression analysis of the blood pressure data from the Framingham Heart Study in the R script . In there a gender difference in regression slopes?
fL<-"https://math.montana.edu/shancock/data/Framingham.txt"
dta <- read.table(fL, header=T)
# recode gender variable
dta$sex <- factor(dta$sex,
levels=c(1, 2),
labels=c("M", "F"))
head(dta)
sex sbp dbp scl chdfate followup age bmi month id
1 M 120 80 267 1 18 55 25.0 8 2642
2 M 130 78 192 1 35 53 28.4 12 4627
3 M 144 90 207 1 109 61 25.1 8 2568
4 M 92 66 231 1 147 48 26.2 11 4192
5 M 162 98 271 1 169 39 28.4 11 3977
6 F 212 118 182 1 199 61 33.3 2 659
library(lattice)
xyplot(sbp ~ dbp | sex,
data=dta, cex=.5,
type=c("p","g","r"),
xlab="Diastolic pressure (mmHg)",
ylab="Systolic pressure (mmHg)")
m0 <- lm(sbp ~ dbp, data=dta)
m0
Call:
lm(formula = sbp ~ dbp, data = dta)
Coefficients:
(Intercept) dbp
16.9 1.4
m1 <- lm(sbp ~ dbp + sex, data=dta)
m1
Call:
lm(formula = sbp ~ dbp + sex, data = dta)
Coefficients:
(Intercept) dbp sexF
14.27 1.41 3.48
m2 <- lm(sbp ~ dbp + sex + sex:dbp, data=dta)
m2
Call:
lm(formula = sbp ~ dbp + sex + sex:dbp, data = dta)
Coefficients:
(Intercept) dbp sexF dbp:sexF
29.854 1.225 -22.101 0.309
anova(m0, m1, m2)
Analysis of Variance Table
Model 1: sbp ~ dbp
Model 2: sbp ~ dbp + sex
Model 3: sbp ~ dbp + sex + sex:dbp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 4697 941778
2 4696 927853 1 13925 71.8 <2e-16
3 4695 910543 1 17310 89.3 <2e-16
A: Yes, the results revealed by linear model showed that the term of sexF is obviously different from m1 and m2.
plot(rstandard(m2) ~ fitted(m2),
cex=.5,
xlab="Fitted values",
ylab="Standardized residuals")
grid()
The standard deviation of length estimates of each subject is available in the loudness data example. Perform a weighted regression and compare the results against an ordinary regression of length onto loudness.
options(digits=5, show.signif.stars=FALSE)
# load the packages
# install.packages("pacman")
pacman::p_load(alr4, tidyverse, magrittr)
# make sure data set is active
data(Stevens, package="alr4")
# save a copy
dta <- Stevens %>%
rename(Length=y, Loudness=loudness, Subject=subject)
ot <- theme_set(theme_bw())
# 1 regression fits all
ggplot(dta, aes(Loudness, Length, group = Subject, alpha = Subject))+
geom_point() +
geom_line() +
stat_smooth(aes(group = 1), method = "lm", size = rel(.8)) +
coord_cartesian(ylim = c(-1, 5)) +
labs(x = "Loudness (db)", y = "Average log(Length)") +
theme(legend.position = "NONE")
m0 <- lm(Length ~ Loudness, data = dta)
# tidy summary model output
broom::tidy(m0)
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.20 0.344 -9.31 2.53e-12
2 Loudness 0.0790 0.00482 16.4 2.69e-21
Note: Sometimes, the function broom::tidy for lm model can’t work. Please refer https://github.com/tidymodels/tidymodels.org/issues/147 In my suggestion, you can use broom::tidy(m0) to instead broom::tidy(summary(m0))
ggplot(dta, aes(Loudness, Length, group = Subject, color = Subject))+
geom_point() +
stat_smooth(method = "lm", se = F, size = rel(.5)) +
scale_color_grey()+
coord_cartesian(ylim = c(-1, 5)) +
labs(x = "Loudness (db)", y = "Average log(Length)") +
theme(legend.position = "NONE")
# grouped by subject before doing regression
dta %>%
group_by(Subject) %>%
do(broom::tidy(lm(Length ~ Loudness, data = .)))
# A tibble: 20 x 6
# Groups: Subject [10]
Subject term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 A (Intercept) -4.49 0.748 -6.01 0.00923
2 A Loudness 0.103 0.0105 9.80 0.00225
3 B (Intercept) -4.59 0.462 -9.92 0.00218
4 B Loudness 0.0935 0.00647 14.5 0.000718
5 C (Intercept) -2.29 0.253 -9.05 0.00285
6 C Loudness 0.0695 0.00355 19.6 0.000291
7 D (Intercept) -2.92 0.338 -8.63 0.00327
8 D Loudness 0.0751 0.00474 15.8 0.000547
9 E (Intercept) -5.39 0.743 -7.24 0.00542
10 E Loudness 0.104 0.0104 9.98 0.00214
11 F (Intercept) -2.42 0.523 -4.63 0.0190
12 F Loudness 0.0774 0.00732 10.6 0.00181
13 G (Intercept) -2.81 0.312 -9.01 0.00288
14 G Loudness 0.0650 0.00437 14.9 0.000659
15 H (Intercept) -3.42 0.401 -8.53 0.00338
16 H Loudness 0.0822 0.00561 14.6 0.000691
17 I (Intercept) -1.95 0.512 -3.80 0.0319
18 I Loudness 0.0656 0.00718 9.14 0.00277
19 J (Intercept) -1.76 0.265 -6.63 0.00699
20 J Loudness 0.0552 0.00372 14.9 0.000660
# fitting a model treating subject as fixed effects
m1 <- lm(Length ~ Subject/Loudness - 1, data = dta)
summary(m1 <- lm(Length ~ Subject/Loudness - 1, data = dta))
Call:
lm(formula = Length ~ Subject/Loudness - 1, data = dta)
Residuals:
Min 1Q Median 3Q Max
-0.3318 -0.1073 0.0023 0.1485 0.3242
Coefficients:
Estimate Std. Error t value Pr(>|t|)
SubjectA -4.49370 0.48664 -9.23 2.8e-10
SubjectB -4.58530 0.48664 -9.42 1.8e-10
SubjectC -2.29360 0.48664 -4.71 5.2e-05
SubjectD -2.91940 0.48664 -6.00 1.4e-06
SubjectE -5.38510 0.48664 -11.07 4.1e-12
SubjectF -2.42120 0.48664 -4.98 2.5e-05
SubjectG -2.81090 0.48664 -5.78 2.6e-06
SubjectH -3.42070 0.48664 -7.03 8.2e-08
SubjectI -1.94870 0.48664 -4.00 0.00038
SubjectJ -1.75910 0.48664 -3.61 0.00109
SubjectA:Loudness 0.10265 0.00681 15.06 1.6e-15
SubjectB:Loudness 0.09355 0.00681 13.73 1.8e-14
SubjectC:Loudness 0.06950 0.00681 10.20 2.9e-11
SubjectD:Loudness 0.07506 0.00681 11.02 4.6e-12
SubjectE:Loudness 0.10391 0.00681 15.25 1.1e-15
SubjectF:Loudness 0.07740 0.00681 11.36 2.2e-12
SubjectG:Loudness 0.06497 0.00681 9.53 1.4e-10
SubjectH:Loudness 0.08223 0.00681 12.07 4.9e-13
SubjectI:Loudness 0.06561 0.00681 9.63 1.1e-10
SubjectJ:Loudness 0.05525 0.00681 8.11 4.7e-09
Residual standard error: 0.215 on 30 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.993
F-statistic: 369 on 20 and 30 DF, p-value: <2e-16
dta <- mutate(dta, fit1 = fitted(m1))
# 1 regression model for all with individual parameters
ggplot(dta, aes(Loudness, fit1, group = Subject, alpha = Subject))+
geom_path() +
geom_point(aes(Loudness, Length, group = Subject, alpha = Subject)) +
coord_cartesian(ylim = c(-1, 5)) +
labs(x = "Loudness (dB)", y = "Average log (Length)") +
theme(legend.position = "NONE")
## Weighted Regression Model
dta$ID <- rep(1:10, each = 5)
broom::tidy(m2 <- lm(Length ~ Loudness, weights=ID, data=dta))
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.83 0.335 -8.46 4.49e-11
2 Loudness 0.0734 0.00469 15.7 1.74e-20
anova(m0, m1, m2)
Analysis of Variance Table
Model 1: Length ~ Loudness
Model 2: Length ~ Subject/Loudness - 1
Model 3: Length ~ Loudness
Res.Df RSS Df Sum of Sq F Pr(>F)
1 48 11.2
2 30 1.4 18 9.8 11.7 5.1e-09
3 48 58.1 -18 -56.7 67.8 < 2e-16
The results between weighted and ordinary regression model is different (F = 67.8, p < .001). Note: I don’t know whether it is normal that Sun of Square is negative value.
The plot below describes the relationship between body weight (in kg) and age (in years) for a sample of 4,011 US girls from the data set USgirl{Brq}. Edit the R script to perform a quantile regression of weight onto age.
#
library(Brq)
#
data(USgirl, package="Brq")
plot(USgirl, pch='.', bty="n",
xlim=c(0, 25),
ylim=c(0, 150),
xlab="Age (in years)",
ylab="Body weight (in kg)")
ggplot(USgirl, aes(Age, Weight)) +
geom_point(alpha=.5, size=rel(.3)) +
geom_quantile(quantiles=1:9/10,
formula=y ~ x)+
labs(x="Age (in years)",
y="Body weight (in kg)") +
theme_minimal()
library(quantreg)
m0_qr <- rq(Weight ~ Age, data=USgirl, tau=1:9/10)
summary(m0_qr)[1]
[[1]]
Call: rq(formula = Weight ~ Age, tau = 1:9/10, data = USgirl)
tau: [1] 0.1
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 4.51227 0.12579 35.87230 0.00000
Age 2.24016 0.02023 110.71797 0.00000
The results revealed that the quantile of age is positively associted with weight.
plot(summary(m0_qr))
The changes for weight to age is larger older than for younger womens.
Create a markdown file to replicate the analysis reported in the box office data example.
dta <- read.table('./data/boxoffice.txt', header = T)
dta$Year_75 <- rep(1:32, 1)
names(dta) <- c("Box_Office", "year", "Year_75")
head(dta)
Box_Office year Year_75
1 95.3 1976 1
2 86.4 1977 2
3 119.4 1978 3
4 124.4 1979 4
5 154.2 1980 5
6 174.3 1981 6
ggplot(data=dta, aes(x=year, y=Box_Office)) +
geom_point(pch=16) +
geom_line(linetype=3) +
labs(x="Year",
y="BoxOffice")+
theme_minimal()
summary(m0 <- lm(Box_Office ~ Year_75, data=dta))
Call:
lm(formula = Box_Office ~ Year_75, data = dta)
Residuals:
Min 1Q Median 3Q Max
-116.38 -79.20 6.08 62.26 121.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -55.93 28.03 -2.0 0.055
Year_75 29.53 1.48 19.9 <2e-16
Residual standard error: 77.4 on 30 degrees of freedom
Multiple R-squared: 0.93, Adjusted R-squared: 0.927
F-statistic: 397 on 1 and 30 DF, p-value: <2e-16
library(languageR)
# acf.fnc(dta,
# group = NULL,
# time="Year_75",
# x="Box_Office",
# plot=TRUE)
I don’t know why it can’t work on my laptop. (Error in [.data.frame(dat, , group) : undefined columns selected)
library(nlme)
summary(m1 <- gls(Box_Office ~ Year_75, data=dta, method="ML",
correlation=corAR1(form= ~ Year_75)))$tTable
Value Std.Error t-value p-value
(Intercept) 4.5141 72.7439 0.062054 9.5093e-01
Year_75 27.0754 3.4477 7.853259 9.1749e-09
dta <- dta %>%
mutate(nres=resid(m1,
type="normalized"))
# acf.fnc(dta,
# time="Year_75",
# x="nres",
# plot=TRUE)
lmtest::dwtest(nres ~ Year_75,
data=dta)
Durbin-Watson test
data: nres ~ Year_75
DW = 2.2, p-value = 0.65
alternative hypothesis: true autocorrelation is greater than 0
I don’t know why the results here is different from the example results.
The exercise solution to the cats data example is generated from the cats.Rmd markdown file.
We load the ‘MASS’ package for the dataset named ‘cats’.
library(MASS)
data(cats, package="MASS")
The ‘cats’ data set is a data frame object in R. It have 144 rows and three columns. Each column is a variable. The first is a variable of factor type: F for female cats and M for male cats, the second and the third variables are of numeric type. They are body weight (in kilograms) and heart weight (in grams).
# structure of data
str(cats)
'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
Have a look at the first 6 lines of the data frame. We will use this data example to illustrate how to compare whether the slopes of two linear regressions are the same or not.
# as oppose to the tail end of the data
head(cats)
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
# load the built-in package
library(lattice)
We create a plot in the trellis graphical style. The left panel is for the relationship between heart weight and body weight of female cats, the right, that for male cats.
# conditional by gender - panels
xyplot(Hwt ~ Bwt | Sex, data=cats, type=c("p","g","r"),
xlab="Body weight (kg)",
ylab="Heart weight (g)")
It might be easier to compare slopes if they share the same frame. The impression is that the regression line for female cats (by comparison) has a larger intercept but a shallower slope, while that for male cats has a steeper slope but a smaller intercept.
# grouped by sex - in one panel
xyplot(Hwt ~ Bwt, groups=Sex, data=cats, type=c("p","g","r"),
xlab="Body weight (Kilograms)",
ylab="Heart weight (Grams)",
auto.key=TRUE)
First, we fit a simple linear model ignoring cats gender. In effect, we assume that cats regardless of gender share the same regression line.
summary(cats_1 <- lm(Hwt ~ Bwt, data=cats))
Call:
lm(formula = Hwt ~ Bwt, data = cats)
Residuals:
Min 1Q Median 3Q Max
-3.569 -0.963 -0.092 1.043 5.124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.357 0.692 -0.52 0.61
Bwt 4.034 0.250 16.12 <2e-16
Residual standard error: 1.45 on 142 degrees of freedom
Multiple R-squared: 0.647, Adjusted R-squared: 0.644
F-statistic: 260 on 1 and 142 DF, p-value: <2e-16
The fitted model is
Hwt = -0.36 + 4.03 Bwt
Next we include cats gender into the regression model. The term ‘Sex’ is a factor variable having two levels. The default coding in R follows the alphanumeric rule by treating ‘F’ in (‘F’ and ‘M’ for Sex) as the reference level.
summary(cats_2 <- lm(Hwt ~ Bwt + Sex, data=cats))
Call:
lm(formula = Hwt ~ Bwt + Sex, data = cats)
Residuals:
Min 1Q Median 3Q Max
-3.583 -0.970 -0.095 1.043 5.102
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4150 0.7273 -0.57 0.57
Bwt 4.0758 0.2948 13.83 <2e-16
SexM -0.0821 0.3040 -0.27 0.79
Residual standard error: 1.46 on 141 degrees of freedom
Multiple R-squared: 0.647, Adjusted R-squared: 0.642
F-statistic: 129 on 2 and 141 DF, p-value: <2e-16
The fitted model is
Hwt = -0.42 + 4.08 Bwt for female cats,
Hwt = (-0.42 - 0.08) + 4.08 Bwt for male cats
The above model assumes different intercepts for female and male cats but the same slope for both. To assume also different slopes between cats gender, we add the ‘interaction’ term ‘Sex:Bwt’ to the model.
# For convenience: lm(Hwt ~ Bwt*Sex, data=cats)
summary(cats_3 <- lm(Hwt ~ Bwt + Sex + Bwt:Sex, data=cats))
Call:
lm(formula = Hwt ~ Bwt + Sex + Bwt:Sex, data = cats)
Residuals:
Min 1Q Median 3Q Max
-3.773 -1.012 -0.120 0.927 4.865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.981 1.843 1.62 0.10796
Bwt 2.636 0.776 3.40 0.00088
SexM -4.165 2.062 -2.02 0.04526
Bwt:SexM 1.676 0.837 2.00 0.04722
Residual standard error: 1.44 on 140 degrees of freedom
Multiple R-squared: 0.657, Adjusted R-squared: 0.649
F-statistic: 89.2 on 3 and 140 DF, p-value: <2e-16
The fitted model is
Hwt = 2.98 + 2.64 Bwt for female cats,
Hwt = (2.98 - 4.17) + (2.64+1.68) Bwt for male cats
The ‘interaction’ term ‘Sex:Bwt’ is statistically significant translates to that rate of change in heart weight in relation body weight for male cats is larger than that for female cats.
The above three models are nested models, i.e., the latter models can be simplified to former models by removing terms in them. We can compare these models by the F-test for the full-reduced model framework via the ‘anova’ command in R.
anova(cats_1, cats_2, cats_3)
Analysis of Variance Table
Model 1: Hwt ~ Bwt
Model 2: Hwt ~ Bwt + Sex
Model 3: Hwt ~ Bwt + Sex + Bwt:Sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 142 300
2 141 299 1 0.15 0.07 0.785
3 140 291 1 8.33 4.01 0.047
We see that model 2 is not different from model 1 but model 3 is different from model 2. The difference is in the gender by body weight term. The F-test gives the same p-value as in the output following the t-value of the model 3 summary. This is not a coincidence.
For the cats data example, the linear relationship between heart weight and body weight is positive and stronger for male cats than for female cats. For each additional increase in body weight by one kilogram, the increase in heart weight is about \(4.32 \pm 0.31\) grams for male cats and about \(2.64 \pm 0.78\) grams for female cats.
Note that the standard error for the slope coefficient estimate for male cats is smaller than that of the female cats. It can be obtained as follows:
show(vm <- vcov(cats_3)[c(2,4), c(2,4)])
Bwt Bwt:SexM
Bwt 0.60202 -0.60202
Bwt:SexM -0.60202 0.70111
sqrt(vm[1,1]+vm[2,2]+2*vm[1,2])
[1] 0.31479
We can also get a rough estimate of this standard error by considering the data for male cats alone.
summary(lm(Hwt ~ Bwt, data=cats, subset=Sex=='M'))
Call:
lm(formula = Hwt ~ Bwt, data = cats, subset = Sex == "M")
Residuals:
Min 1Q Median 3Q Max
-3.773 -1.048 -0.298 0.984 4.865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.184 0.998 -1.19 0.24
Bwt 4.313 0.340 12.69 <2e-16
Residual standard error: 1.56 on 95 degrees of freedom
Multiple R-squared: 0.629, Adjusted R-squared: 0.625
F-statistic: 161 on 1 and 95 DF, p-value: <2e-16
What is the difference between the slope estimate for male cats in Model 3 we have used earlier and that in the analysis of the male cats alone? A: The slope estimate for male cats is 3.13 in M3 and 4.31 in the analysis of the male cats alone. The difference between two models is -1.18, which equal to the slope estimates of intercept subtracted from SexM (2.98 - 4.17).
Why is that the standard error for the slope estimate of female cats is much larger than that for male cats? A: In my opinion, it related to larger sample size for male cats than female cats