1 In-class exercises 1:

Complete the regression analysis of the blood pressure data from the Framingham Heart Study in the R script . In there a gender difference in regression slopes?

1.1 data management

fL<-"https://math.montana.edu/shancock/data/Framingham.txt"

dta <- read.table(fL, header=T)

# recode gender variable
dta$sex <- factor(dta$sex, 
                  levels=c(1, 2), 
                  labels=c("M", "F")) 
head(dta)

  sex sbp dbp scl chdfate followup age  bmi month   id
1   M 120  80 267       1       18  55 25.0     8 2642
2   M 130  78 192       1       35  53 28.4    12 4627
3   M 144  90 207       1      109  61 25.1     8 2568
4   M  92  66 231       1      147  48 26.2    11 4192
5   M 162  98 271       1      169  39 28.4    11 3977
6   F 212 118 182       1      199  61 33.3     2  659

1.2 lattice plot

library(lattice)
xyplot(sbp ~ dbp | sex, 
       data=dta, cex=.5,
       type=c("p","g","r"),
       xlab="Diastolic pressure (mmHg)",
       ylab="Systolic pressure (mmHg)")

1.3 analysis

m0 <- lm(sbp ~ dbp, data=dta)
m0


Call:
lm(formula = sbp ~ dbp, data = dta)

Coefficients:
(Intercept)          dbp  
       16.9          1.4

m1 <- lm(sbp ~ dbp + sex, data=dta)
m1


Call:
lm(formula = sbp ~ dbp + sex, data = dta)

Coefficients:
(Intercept)          dbp         sexF  
      14.27         1.41         3.48

m2 <- lm(sbp ~ dbp + sex + sex:dbp, data=dta)
m2


Call:
lm(formula = sbp ~ dbp + sex + sex:dbp, data = dta)

Coefficients:
(Intercept)          dbp         sexF     dbp:sexF  
     29.854        1.225      -22.101        0.309

anova(m0, m1, m2)

Analysis of Variance Table

Model 1: sbp ~ dbp
Model 2: sbp ~ dbp + sex
Model 3: sbp ~ dbp + sex + sex:dbp
  Res.Df    RSS Df Sum of Sq    F Pr(>F)
1   4697 941778                         
2   4696 927853  1     13925 71.8 <2e-16
3   4695 910543  1     17310 89.3 <2e-16

A: Yes, the results revealed by linear model showed that the term of sexF is obviously different from m1 and m2.

1.4 residual plot

plot(rstandard(m2) ~ fitted(m2), 
     cex=.5, 
     xlab="Fitted values", 
     ylab="Standardized residuals")
grid()

2 In-class exercises 2:

The standard deviation of length estimates of each subject is available in the loudness data example. Perform a weighted regression and compare the results against an ordinary regression of length onto loudness.

2.1 regression with clusters

options(digits=5, show.signif.stars=FALSE)

# load the packages
# install.packages("pacman")
pacman::p_load(alr4, tidyverse, magrittr)

# make sure data set is active
data(Stevens, package="alr4")

# save a copy
dta <- Stevens %>% 
rename(Length=y, Loudness=loudness, Subject=subject)

2.2 Visualization

ot <- theme_set(theme_bw())

# 1 regression fits all
ggplot(dta, aes(Loudness, Length, group = Subject, alpha = Subject))+
 geom_point() +
 geom_line() +
 stat_smooth(aes(group = 1), method = "lm", size = rel(.8)) +
 coord_cartesian(ylim = c(-1, 5)) +
 labs(x = "Loudness (db)", y = "Average log(Length)") +
 theme(legend.position = "NONE")

2.3 one model fits all

m0 <- lm(Length ~ Loudness, data = dta)

# tidy summary model output
broom::tidy(m0)

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -3.20     0.344       -9.31 2.53e-12
2 Loudness      0.0790   0.00482     16.4  2.69e-21

Note: Sometimes, the function broom::tidy for lm model can’t work. Please refer https://github.com/tidymodels/tidymodels.org/issues/147 In my suggestion, you can use broom::tidy(m0) to instead broom::tidy(summary(m0))

2.4 1 regression model per individual

ggplot(dta, aes(Loudness, Length, group = Subject, color = Subject))+
 geom_point() +
 stat_smooth(method = "lm", se = F, size = rel(.5)) +
 scale_color_grey()+
 coord_cartesian(ylim = c(-1, 5)) +
 labs(x = "Loudness (db)", y = "Average log(Length)") +
 theme(legend.position = "NONE")

2.5 fancy way of conducting 1 regression model per subject

# grouped by subject before doing regression
dta %>% 
 group_by(Subject) %>% 
 do(broom::tidy(lm(Length ~ Loudness, data = .)))

# A tibble: 20 x 6
# Groups:   Subject [10]
   Subject term        estimate std.error statistic  p.value
   <fct>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 A       (Intercept)  -4.49     0.748       -6.01 0.00923 
 2 A       Loudness      0.103    0.0105       9.80 0.00225 
 3 B       (Intercept)  -4.59     0.462       -9.92 0.00218 
 4 B       Loudness      0.0935   0.00647     14.5  0.000718
 5 C       (Intercept)  -2.29     0.253       -9.05 0.00285 
 6 C       Loudness      0.0695   0.00355     19.6  0.000291
 7 D       (Intercept)  -2.92     0.338       -8.63 0.00327 
 8 D       Loudness      0.0751   0.00474     15.8  0.000547
 9 E       (Intercept)  -5.39     0.743       -7.24 0.00542 
10 E       Loudness      0.104    0.0104       9.98 0.00214 
11 F       (Intercept)  -2.42     0.523       -4.63 0.0190  
12 F       Loudness      0.0774   0.00732     10.6  0.00181 
13 G       (Intercept)  -2.81     0.312       -9.01 0.00288 
14 G       Loudness      0.0650   0.00437     14.9  0.000659
15 H       (Intercept)  -3.42     0.401       -8.53 0.00338 
16 H       Loudness      0.0822   0.00561     14.6  0.000691
17 I       (Intercept)  -1.95     0.512       -3.80 0.0319  
18 I       Loudness      0.0656   0.00718      9.14 0.00277 
19 J       (Intercept)  -1.76     0.265       -6.63 0.00699 
20 J       Loudness      0.0552   0.00372     14.9  0.000660

# fitting a model treating subject as fixed effects
m1 <- lm(Length ~ Subject/Loudness - 1, data = dta)
summary(m1 <- lm(Length ~ Subject/Loudness - 1, data = dta))


Call:
lm(formula = Length ~ Subject/Loudness - 1, data = dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3318 -0.1073  0.0023  0.1485  0.3242 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
SubjectA          -4.49370    0.48664   -9.23  2.8e-10
SubjectB          -4.58530    0.48664   -9.42  1.8e-10
SubjectC          -2.29360    0.48664   -4.71  5.2e-05
SubjectD          -2.91940    0.48664   -6.00  1.4e-06
SubjectE          -5.38510    0.48664  -11.07  4.1e-12
SubjectF          -2.42120    0.48664   -4.98  2.5e-05
SubjectG          -2.81090    0.48664   -5.78  2.6e-06
SubjectH          -3.42070    0.48664   -7.03  8.2e-08
SubjectI          -1.94870    0.48664   -4.00  0.00038
SubjectJ          -1.75910    0.48664   -3.61  0.00109
SubjectA:Loudness  0.10265    0.00681   15.06  1.6e-15
SubjectB:Loudness  0.09355    0.00681   13.73  1.8e-14
SubjectC:Loudness  0.06950    0.00681   10.20  2.9e-11
SubjectD:Loudness  0.07506    0.00681   11.02  4.6e-12
SubjectE:Loudness  0.10391    0.00681   15.25  1.1e-15
SubjectF:Loudness  0.07740    0.00681   11.36  2.2e-12
SubjectG:Loudness  0.06497    0.00681    9.53  1.4e-10
SubjectH:Loudness  0.08223    0.00681   12.07  4.9e-13
SubjectI:Loudness  0.06561    0.00681    9.63  1.1e-10
SubjectJ:Loudness  0.05525    0.00681    8.11  4.7e-09

Residual standard error: 0.215 on 30 degrees of freedom
Multiple R-squared:  0.996, Adjusted R-squared:  0.993 
F-statistic:  369 on 20 and 30 DF,  p-value: <2e-16

2.6 augument data with model fit

dta <- mutate(dta, fit1 = fitted(m1))

2.7 Visualization

# 1 regression model for all with individual parameters 
ggplot(dta, aes(Loudness, fit1, group = Subject, alpha = Subject))+
 geom_path() +
 geom_point(aes(Loudness, Length, group = Subject, alpha = Subject)) +
 coord_cartesian(ylim = c(-1, 5)) +
 labs(x = "Loudness (dB)", y = "Average log (Length)") +
 theme(legend.position = "NONE")

## Weighted Regression Model

dta$ID <- rep(1:10, each = 5)
broom::tidy(m2 <- lm(Length ~ Loudness, weights=ID, data=dta))

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -2.83     0.335       -8.46 4.49e-11
2 Loudness      0.0734   0.00469     15.7  1.74e-20

anova(m0, m1, m2)

Analysis of Variance Table

Model 1: Length ~ Loudness
Model 2: Length ~ Subject/Loudness - 1
Model 3: Length ~ Loudness
  Res.Df  RSS  Df Sum of Sq    F  Pr(>F)
1     48 11.2                           
2     30  1.4  18       9.8 11.7 5.1e-09
3     48 58.1 -18     -56.7 67.8 < 2e-16

The results between weighted and ordinary regression model is different (F = 67.8, p < .001). Note: I don’t know whether it is normal that Sun of Square is negative value.

3 In-class exercises 3:

The plot below describes the relationship between body weight (in kg) and age (in years) for a sample of 4,011 US girls from the data set USgirl{Brq}. Edit the R script to perform a quantile regression of weight onto age.

3.1 Data Manipulation

#
library(Brq)
#
data(USgirl, package="Brq")

3.2 Visualization

plot(USgirl, pch='.', bty="n",
     xlim=c(0, 25),
     ylim=c(0, 150),
     xlab="Age (in years)",
     ylab="Body weight (in kg)")

3.3 Add quantile regression line

ggplot(USgirl, aes(Age, Weight)) + 
 geom_point(alpha=.5, size=rel(.3)) + 
 geom_quantile(quantiles=1:9/10, 
               formula=y ~ x)+
 labs(x="Age (in years)", 
      y="Body weight (in kg)") +
 theme_minimal()

3.4 Quantile Regression for Age and Weight

library(quantreg)
m0_qr <- rq(Weight ~ Age, data=USgirl, tau=1:9/10)
summary(m0_qr)[1]

[[1]]

Call: rq(formula = Weight ~ Age, tau = 1:9/10, data = USgirl)

tau: [1] 0.1

Coefficients:
            Value     Std. Error t value   Pr(>|t|) 
(Intercept)   4.51227   0.12579   35.87230   0.00000
Age           2.24016   0.02023  110.71797   0.00000

The results revealed that the quantile of age is positively associted with weight.

plot(summary(m0_qr))

The changes for weight to age is larger older than for younger womens.

4 In-class exercises 4:

Create a markdown file to replicate the analysis reported in the box office data example.

4.1 Data Manipulation

dta <- read.table('./data/boxoffice.txt', header = T)
dta$Year_75 <- rep(1:32, 1)
names(dta) <- c("Box_Office", "year", "Year_75")
head(dta)

  Box_Office year Year_75
1       95.3 1976       1
2       86.4 1977       2
3      119.4 1978       3
4      124.4 1979       4
5      154.2 1980       5
6      174.3 1981       6

4.2 Visulization

ggplot(data=dta, aes(x=year, y=Box_Office)) +
 geom_point(pch=16) +
 geom_line(linetype=3) +
 labs(x="Year", 
      y="BoxOffice")+
 theme_minimal()

4.3 Regression model

summary(m0 <- lm(Box_Office ~ Year_75, data=dta))


Call:
lm(formula = Box_Office ~ Year_75, data = dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-116.38  -79.20    6.08   62.26  121.70 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -55.93      28.03    -2.0    0.055
Year_75        29.53       1.48    19.9   <2e-16

Residual standard error: 77.4 on 30 degrees of freedom
Multiple R-squared:  0.93,  Adjusted R-squared:  0.927 
F-statistic:  397 on 1 and 30 DF,  p-value: <2e-16

library(languageR)
# acf.fnc(dta, 
#         group = NULL,
#         time="Year_75", 
#         x="Box_Office", 
#         plot=TRUE)

I don’t know why it can’t work on my laptop. (Error in [.data.frame(dat, , group) : undefined columns selected)

4.4 Generalized Least Squares

library(nlme)
summary(m1 <- gls(Box_Office ~ Year_75, data=dta, method="ML",
                                             correlation=corAR1(form= ~ Year_75)))$tTable

              Value Std.Error  t-value    p-value
(Intercept)  4.5141   72.7439 0.062054 9.5093e-01
Year_75     27.0754    3.4477 7.853259 9.1749e-09

4.5 Residual correlations

dta <- dta %>% 
  mutate(nres=resid(m1, 
                    type="normalized"))

# acf.fnc(dta, 
#         time="Year_75", 
#         x="nres", 
#         plot=TRUE)

lmtest::dwtest(nres ~ Year_75, 
               data=dta)


    Durbin-Watson test

data:  nres ~ Year_75
DW = 2.2, p-value = 0.65
alternative hypothesis: true autocorrelation is greater than 0

I don’t know why the results here is different from the example results.

5 In-class exercises 5:

The exercise solution to the cats data example is generated from the cats.Rmd markdown file.

5.1 Data

We load the ‘MASS’ package for the dataset named ‘cats’.

library(MASS)
data(cats, package="MASS")

The ‘cats’ data set is a data frame object in R. It have 144 rows and three columns. Each column is a variable. The first is a variable of factor type: F for female cats and M for male cats, the second and the third variables are of numeric type. They are body weight (in kilograms) and heart weight (in grams).

# structure of data 
str(cats)

'data.frame':   144 obs. of  3 variables:
 $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
 $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

Have a look at the first 6 lines of the data frame. We will use this data example to illustrate how to compare whether the slopes of two linear regressions are the same or not.

# as oppose to the tail end of the data
head(cats)

  Sex Bwt Hwt
1   F 2.0 7.0
2   F 2.0 7.4
3   F 2.0 9.5
4   F 2.1 7.2
5   F 2.1 7.3
6   F 2.1 7.6

5.2 Visualization

# load the built-in package
library(lattice)

We create a plot in the trellis graphical style. The left panel is for the relationship between heart weight and body weight of female cats, the right, that for male cats.

# conditional by gender - panels
xyplot(Hwt ~ Bwt | Sex, data=cats, type=c("p","g","r"),
       xlab="Body weight (kg)",
       ylab="Heart weight (g)")

It might be easier to compare slopes if they share the same frame. The impression is that the regression line for female cats (by comparison) has a larger intercept but a shallower slope, while that for male cats has a steeper slope but a smaller intercept.

# grouped by sex - in one panel
xyplot(Hwt ~ Bwt, groups=Sex, data=cats, type=c("p","g","r"),
       xlab="Body weight (Kilograms)",
       ylab="Heart weight (Grams)",
       auto.key=TRUE)

5.3 Linear regression models

First, we fit a simple linear model ignoring cats gender. In effect, we assume that cats regardless of gender share the same regression line.

summary(cats_1 <- lm(Hwt ~ Bwt, data=cats))


Call:
lm(formula = Hwt ~ Bwt, data = cats)

Residuals:
   Min     1Q Median     3Q    Max 
-3.569 -0.963 -0.092  1.043  5.124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -0.357      0.692   -0.52     0.61
Bwt            4.034      0.250   16.12   <2e-16

Residual standard error: 1.45 on 142 degrees of freedom
Multiple R-squared:  0.647, Adjusted R-squared:  0.644 
F-statistic:  260 on 1 and 142 DF,  p-value: <2e-16

The fitted model is

Hwt = -0.36 + 4.03 Bwt

Next we include cats gender into the regression model. The term ‘Sex’ is a factor variable having two levels. The default coding in R follows the alphanumeric rule by treating ‘F’ in (‘F’ and ‘M’ for Sex) as the reference level.

summary(cats_2 <- lm(Hwt ~ Bwt + Sex, data=cats))


Call:
lm(formula = Hwt ~ Bwt + Sex, data = cats)

Residuals:
   Min     1Q Median     3Q    Max 
-3.583 -0.970 -0.095  1.043  5.102 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.4150     0.7273   -0.57     0.57
Bwt           4.0758     0.2948   13.83   <2e-16
SexM         -0.0821     0.3040   -0.27     0.79

Residual standard error: 1.46 on 141 degrees of freedom
Multiple R-squared:  0.647, Adjusted R-squared:  0.642 
F-statistic:  129 on 2 and 141 DF,  p-value: <2e-16

The fitted model is

Hwt = -0.42 + 4.08 Bwt for female cats,

Hwt = (-0.42 - 0.08) + 4.08 Bwt for male cats

The above model assumes different intercepts for female and male cats but the same slope for both. To assume also different slopes between cats gender, we add the ‘interaction’ term ‘Sex:Bwt’ to the model.

# For convenience: lm(Hwt ~ Bwt*Sex, data=cats)
summary(cats_3 <- lm(Hwt ~ Bwt + Sex + Bwt:Sex, data=cats))


Call:
lm(formula = Hwt ~ Bwt + Sex + Bwt:Sex, data = cats)

Residuals:
   Min     1Q Median     3Q    Max 
-3.773 -1.012 -0.120  0.927  4.865 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    2.981      1.843    1.62  0.10796
Bwt            2.636      0.776    3.40  0.00088
SexM          -4.165      2.062   -2.02  0.04526
Bwt:SexM       1.676      0.837    2.00  0.04722

Residual standard error: 1.44 on 140 degrees of freedom
Multiple R-squared:  0.657, Adjusted R-squared:  0.649 
F-statistic: 89.2 on 3 and 140 DF,  p-value: <2e-16

The fitted model is

Hwt = 2.98 + 2.64 Bwt for female cats,

Hwt = (2.98 - 4.17) + (2.64+1.68) Bwt for male cats

The ‘interaction’ term ‘Sex:Bwt’ is statistically significant translates to that rate of change in heart weight in relation body weight for male cats is larger than that for female cats.

The above three models are nested models, i.e., the latter models can be simplified to former models by removing terms in them. We can compare these models by the F-test for the full-reduced model framework via the ‘anova’ command in R.

anova(cats_1, cats_2, cats_3)

Analysis of Variance Table

Model 1: Hwt ~ Bwt
Model 2: Hwt ~ Bwt + Sex
Model 3: Hwt ~ Bwt + Sex + Bwt:Sex
  Res.Df RSS Df Sum of Sq    F Pr(>F)
1    142 300                         
2    141 299  1      0.15 0.07  0.785
3    140 291  1      8.33 4.01  0.047

We see that model 2 is not different from model 1 but model 3 is different from model 2. The difference is in the gender by body weight term. The F-test gives the same p-value as in the output following the t-value of the model 3 summary. This is not a coincidence.

5.4 Conclusions

For the cats data example, the linear relationship between heart weight and body weight is positive and stronger for male cats than for female cats. For each additional increase in body weight by one kilogram, the increase in heart weight is about \(4.32 \pm 0.31\) grams for male cats and about \(2.64 \pm 0.78\) grams for female cats.

Note that the standard error for the slope coefficient estimate for male cats is smaller than that of the female cats. It can be obtained as follows:

show(vm <- vcov(cats_3)[c(2,4), c(2,4)])

              Bwt Bwt:SexM
Bwt       0.60202 -0.60202
Bwt:SexM -0.60202  0.70111

sqrt(vm[1,1]+vm[2,2]+2*vm[1,2])

[1] 0.31479

We can also get a rough estimate of this standard error by considering the data for male cats alone.

summary(lm(Hwt ~ Bwt, data=cats, subset=Sex=='M'))


Call:
lm(formula = Hwt ~ Bwt, data = cats, subset = Sex == "M")

Residuals:
   Min     1Q Median     3Q    Max 
-3.773 -1.048 -0.298  0.984  4.865 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -1.184      0.998   -1.19     0.24
Bwt            4.313      0.340   12.69   <2e-16

Residual standard error: 1.56 on 95 degrees of freedom
Multiple R-squared:  0.629, Adjusted R-squared:  0.625 
F-statistic:  161 on 1 and 95 DF,  p-value: <2e-16

5.5 Further questions

What is the difference between the slope estimate for male cats in Model 3 we have used earlier and that in the analysis of the male cats alone? A: The slope estimate for male cats is 3.13 in M3 and 4.31 in the analysis of the male cats alone. The difference between two models is -1.18, which equal to the slope estimates of intercept subtracted from SexM (2.98 - 4.17).
Why is that the standard error for the slope estimate of female cats is much larger than that for male cats? A: In my opinion, it related to larger sample size for male cats than female cats

Regression: In-class exercises

Hao-Lun Fu

2020-09-28

1 In-class exercises 1:

1.1 data management

1.2 lattice plot

1.3 analysis

1.4 residual plot

2 In-class exercises 2:

2.1 regression with clusters

2.2 Visualization

2.3 one model fits all

2.4 1 regression model per individual

2.5 fancy way of conducting 1 regression model per subject

2.6 augument data with model fit

2.7 Visualization

3 In-class exercises 3:

3.1 Data Manipulation

3.2 Visualization

3.3 Add quantile regression line

3.4 Quantile Regression for Age and Weight

4 In-class exercises 4:

4.1 Data Manipulation

4.2 Visulization

4.3 Regression model

4.4 Generalized Least Squares

4.5 Residual correlations

5 In-class exercises 5:

5.1 Data

5.2 Visualization

5.3 Linear regression models

5.4 Conclusions

5.5 Further questions