COD_week13_1_MGK_BTE3207

Minsik Kim

2023-11-20

Agenda

This lecture will practice running

  1. simple linear models

  2. multiple linear regression

  3. Effect modification

  4. ANOVA

Before begin..

Let’s load datasets.

the SBP dataset

dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")



head(dataset_sbp)
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1

Data manipulation

Making a new variable hypertension

dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 |
                                           dataset_sbp$DBP > 80,
                                   T,
                                   F)


dataset_sbp$history_of_hypertension <- ifelse(dataset_sbp$DIS == 1 |
                                           dataset_sbp$DIS == 2,
                                   T,
                                   F)

dataset_sbp$DIS <- factor(dataset_sbp$DIS, 
                          levels = c(1,2,3,4),
                          labels = c("Hypertension and diabetes",
                                     "Hypertension",
                                     "Diabetes",
                                     "No history"))

set.seed(1)

dataset_sbp_small <- subset(dataset_sbp, row.names(dataset_sbp) %in% sample(x = 1:1000000, size = 1000))

UCLA school dataset

https://stats.idre.ucla.edu/

Dataset is students from 10 handpicked schools, representing a subset of students and schools from a US survey of eight-grade students at 1000 schools (800 public 200 private).

There are quite a lot of variables in the dataset, but we are going start by using just three:


 # Variable Type Len Pos Label
-----------------------------------------------------------------------------------------------
15 cstr     Num    8 112
5 homework Num    8  32 Time spent on math homework each week
11 math     Num    8  80 Math score
4 meanses  Num    8  24 Mean SES for the school
7 parented Num    8  48 Parents highest education level

10 percmin  Num    8  72 Percent minority in school
 8 public   Num    8  56 Public school: 1=public, 0=non-public
13 race     Num    8  96 race of student, 1=asian, 2=Hispanic, 3=Black, 4=White, 5=Native American
 9 ratio    Num    8  64 Student-Teacher ratio
18 region   Num    8 136
 1 schid    Num    8   0 School ID
19 schnum   Num    8 144 group(schid)
16 scsize   Num    8 120
14 sctype   Num    8 104 Type of school, 1=public, 2=catholic, 3=Private
                         other religious, 4=Private non-r
 3 ses      Num    8  16 Socioecnonomic Status
12 sex      Num    8  88 Sex: 1=male, 2=female
 2 stuid    Num    8   8 Student ID
17 urban    Num    8 128
 6 white    Num    8  40 Race: 1=white, 0=non-white

Here,

homework: number of hours of homework - Level 1 variable (associated with students) math: score in a math test - expanatory variable schnum: a number for each school - Level 2 variable (associated with school) scsize: school size - Level 2 variable public: school type - Level 2 variable

library(haven)

imm10_data <- read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta") 

Before analyzing..

You need to double-check if the regression is appropriate or not.

Simple linear regression

Step 1

You need to store the linear model first.

sbp_dbp_fit <- lm(data = dataset_sbp_small, SBP ~ DBP)

You can run ANOVA on the anlaysis result.

sbp_dbp_fit %>% anova
## Analysis of Variance Table
## 
## Response: SBP
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## DBP         1 125712  125712  1396.7 < 2.2e-16 ***
## Residuals 998  89828      90                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the stored data, you can extract multiple summarized informaiton.

sbp_dbp_fit %>% names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

You can choose the model using

sbp_dbp_fit$call
## lm(formula = SBP ~ DBP, data = dataset_sbp_small)

To laod the beta values,

sbp_dbp_fit$coefficients
## (Intercept)         DBP 
##   36.685904    1.116707

Choosing the numeric values in the lm() output can be conducted using double-brackets

sbp_dbp_fit$coefficients[[1]]
## [1] 36.6859
sbp_dbp_fit$coefficients[[2]]
## [1] 1.116707

…etc.

Step 2

You can generate a plot with linear regression

plot(SBP ~ DBP, data = dataset_sbp_small) +
abline(sbp_dbp_fit, col = "red") +
title("DBP vs SBP, 1000 person")

## integer(0)

Well, it seems like the data fits well…. but does the model is actually representing the association vetween SBP vs DBP or just the data?

Wroing example

Let’s assuem that our data is somehow having more data at the lower-end

hw_math_fit <- lm(data = imm10_data, math ~ homework)

plot(math ~ homework, data = imm10_data) +
abline(hw_math_fit, col = "red") +
title("Homework vs Math")

## integer(0)
imm10_data %>% 
        mutate(urban = as.factor(urban),
               public = as.factor(public)) %>%
        ggplot2::ggplot(aes(y = math, x = homework, col = urban,
                       shape = public)) +
        geom_point() 

Do you think the data is normally distributed?

Double-checking the distribution of data

SBP dataset

hist(dataset_sbp_small$SBP)

DBP data

hist(dataset_sbp_small$DBP)

Homework data of urban schools

imm10_data %>% filter(urban == 1) %>% .$homework %>% hist

Urban schools had right skewed data.

Do we need to generate these histograms, every time, for every variables that we are using?

Using plot() function

Using plot funcion you can explore data at once

imm10_data %>% plot

Using GGrally package

An improved version of those associations, can be double-checked using the below package

library(GGally)

ggpairs(imm10_data)

Adding normal curve to histogram

How can we compare them with normal distribution?

We can add normal curve to the data

hist(dataset_sbp_small$SBP, freq = F)
curve(dnorm(x,
            mean= mean(dataset_sbp_small$SBP), 
            sd=sd(dataset_sbp_small$SBP)), 
      col="darkblue", lwd=2, add=TRUE)

SBP was quite normal.

hist(dataset_sbp_small$DBP, freq = F)
curve(dnorm(x,
            mean= mean(dataset_sbp_small$DBP), 
            sd=sd(dataset_sbp_small$DBP)), 
      col="darkblue", lwd=2, add=TRUE)

DBP was somwhat normal..

hist(imm10_data$homework, freq = F, breaks = 9)
curve(dnorm(x,
            mean= mean(imm10_data$homework), 
            sd=sd(imm10_data$homework)), 
      col="darkblue", lwd=2, add=TRUE)

Not normal at all

Deciding data is normal

We can generate a quantile of normal distribution and that of data! It woll show the different between normal curve and samples in more intuitive way.

Here, the x-axis is the quantile of a normal distribution of having the same mean and SD with dataset.

qqnorm(rnorm(n = 10000), pch = 1, frame = FALSE)
qqline(rnorm(n = 10000),
       col = "red", lwd = 2)

Q-Q plot of normal data

Q-Q plot of SBP dataset

With SBP data, it looks like this.

qqnorm(dataset_sbp_small$SBP, pch = 1, frame = FALSE)
qqline(dataset_sbp_small$SBP,
       col = "red", lwd = 2)

Here, the x-axis is the quantile of a normal distribution of having the same mean and SD with SBP dataset.

Q-Q plot of DBP dataset

With DBP data, it looks like this.

qqnorm(dataset_sbp_small$DBP, pch = 1, frame = FALSE)
qqline(dataset_sbp_small$DBP,
       col = "red", lwd = 2)

DBP is less normal than SBP. The error seems to high at higher quantile values.

Q-Q plot of homework

qqnorm(imm10_data$homework, pch = 1, frame = FALSE)
qqline(imm10_data$homework,
       col = "red", lwd = 2)

## Q-Q of math test results

qqnorm(imm10_data$math, pch = 1, frame = FALSE)
qqline(imm10_data$math,
       col = "red", lwd = 2)

Obviously, the data showed unreasonable outcome.

Does it mean that our data is useless?

Evaluation after running lme

We can also plot the results of the linear model.

par(mfrow=c(2,2))
plot(sbp_dbp_fit)

par(mfrow=c(1,1))

Here,

  1. residual vs fillted shows the linearity between two variables

i.e., it investigates weather the residual is uniform across all the independent variable.

  1. Q-Q plot shows the normality of both data

This plots the qunatile of residuals. The residuals alsho has to have normal distribution. The theoretical residual can be calculated by sqrt(MSE) * z((k-0.375/(n+0.25))). Where the MSE is mean squares, k is the rank, n is the total sample number, and the z is tne normal distribtion’s pnorm().

  1. Standardized version of plot 1 (variable omn the same scale).

  2. residual v leverage plot. This calculates which one is outlier (based on assumption that the data is normally distributed).We only have 3 outliars for SBP data.

Doing the same thing for imm-dataset,

par(mfrow=c(2,2))
plot(hw_math_fit)

par(mfrow=c(1,1))

Although the distribution of data was not normal, the residuals were normal. So we can use the dataset.

Multiple linear regression

Now we are going to run multiple linear regression using R

sbp_dbp_fit_mlr <- lm(data = dataset_sbp_small, SBP ~ DBP + BMI + SEX + DIS)

Running anova

sbp_dbp_fit_mlr %>% anova
## Analysis of Variance Table
## 
## Response: SBP
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## DBP         1 125712  125712 1535.552 < 2.2e-16 ***
## BMI         1   1536    1536   18.756 1.636e-05 ***
## SEX         1    138     138    1.683    0.1948    
## DIS         3   6860    2287   27.933 < 2.2e-16 ***
## Residuals 993  81294      82                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gender is irrelevant to our analysis - for now I am going to remove it from the analysis

view summary

sbp_dbp_fit_mlr %>% summary
## 
## Call:
## lm(formula = SBP ~ DBP + BMI + SEX + DIS, data = dataset_sbp_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.479  -5.195  -0.890   5.838  39.023 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     44.31409    3.45760  12.816  < 2e-16 ***
## DBP              1.01381    0.03070  33.023  < 2e-16 ***
## BMI              0.29627    0.08928   3.318 0.000939 ***
## SEX             -0.89221    0.58841  -1.516 0.129762    
## DISHypertension -0.88760    1.39876  -0.635 0.525860    
## DISDiabetes     -3.52533    1.89557  -1.860 0.063213 .  
## DISNo history   -7.13347    1.27451  -5.597 2.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.048 on 993 degrees of freedom
## Multiple R-squared:  0.6228, Adjusted R-squared:  0.6206 
## F-statistic: 273.3 on 6 and 993 DF,  p-value: < 2.2e-16

Can run the same thing with mlr

par(mfrow=c(2,2))
plot(sbp_dbp_fit_mlr)

par(mfrow=c(1,1))

To make adjusted plot, You can choose the data as below.

sbp_dbp_fit_mlr$coefficients["(Intercept)"][[1]]
## [1] 44.31409
sbp_dbp_fit_mlr$coefficients["DBP"][[1]]
## [1] 1.013805

Using the adjusted coefficients of the linear model, you can draw linear model.

plot(data = dataset_sbp_small, SBP~DBP,
     main = "Simple linear model"
     ) +
        abline(a = sbp_dbp_fit$coefficients["(Intercept)"][[1]],
               b = sbp_dbp_fit$coefficients["DBP"][[1]], 
               col = "blue",
               linewidth = 2) 

## integer(0)
plot(data = dataset_sbp_small, SBP~DBP,
     main = "BMI, DIS adjusted linear model\n(of BMI = 0, DIS group 1)"
     ) +
        abline(a = sbp_dbp_fit_mlr$coefficients["(Intercept)"][[1]],
               b = sbp_dbp_fit_mlr$coefficients["DBP"][[1]], 
               col = "red",
               linewidth = 2) 

## integer(0)

Effect modificaitons

Interaction terms

Interaction term can be added easily

sbp_dbp_fit_int <- lm(data = dataset_sbp_small, SBP ~ DBP + BMI + SEX * DIS)

Running anova

sbp_dbp_fit_int %>% anova
## Analysis of Variance Table
## 
## Response: SBP
##            Df Sum Sq Mean Sq   F value    Pr(>F)    
## DBP         1 125712  125712 1553.4602 < 2.2e-16 ***
## BMI         1   1536    1536   18.9749 1.462e-05 ***
## SEX         1    138     138    1.7027  0.192244    
## DIS         3   6860    2287   28.2586 < 2.2e-16 ***
## SEX:DIS     3   1180     393    4.8603  0.002318 ** 
## Residuals 990  80114      81                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gender is irrelevant to our analysis - for now I am going to remove it from the analysis

view summary

sbp_dbp_fit_int %>% summary
## 
## Call:
## lm(formula = SBP ~ DBP + BMI + SEX * DIS, data = dataset_sbp_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.374  -5.505  -1.075   5.805  40.956 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          46.69544    4.71991   9.893  < 2e-16 ***
## DBP                   1.00687    0.03058  32.929  < 2e-16 ***
## BMI                   0.25843    0.08950   2.887  0.00397 ** 
## SEX                  -1.53479    2.51128  -0.611  0.54123    
## DISHypertension      -7.91562    4.22418  -1.874  0.06124 .  
## DISDiabetes         -11.25986    5.70926  -1.972  0.04886 *  
## DISNo history        -6.27973    3.76541  -1.668  0.09568 .  
## SEX:DISHypertension   4.74479    2.86857   1.654  0.09843 .  
## SEX:DISDiabetes       5.37686    3.83921   1.401  0.16167    
## SEX:DISNo history    -0.58478    2.60166  -0.225  0.82220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.996 on 990 degrees of freedom
## Multiple R-squared:  0.6283, Adjusted R-squared:  0.6249 
## F-statistic: 185.9 on 9 and 990 DF,  p-value: < 2.2e-16

Can run the same thing with mlr

par(mfrow=c(2,2))
plot(sbp_dbp_fit_mlr)

par(mfrow=c(1,1))

To make adjusted plot, You can choose the data as below.

sbp_dbp_fit_mlr$coefficients["(Intercept)"][[1]]
## [1] 44.31409
sbp_dbp_fit_mlr$coefficients["DBP"][[1]]
## [1] 1.013805

Using the adjusted coefficients of the linear model, you can draw linear model.

In ggplot package, geom_smooth(method = “lm”) will generate linear regression of each data groups at once!

ggplot2::ggplot(dataset_sbp_small, aes(y = SBP, x = DBP, col = DIS)) +
        geom_point() +
        geom_smooth(method = "lm")

Other example would be…

ggplot2::ggplot(imm10_data, aes(y = math, x = homework, col = as.factor(schnum))) +
        geom_point() +
        geom_smooth(method = "lm")

Stratification

Manucal

Data can be splited using following filter() function

data_dis_1 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 1)
data_dis_2 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 2)
data_dis_3 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 3)
data_dis_4 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 4)
data_dis_1 %>%
        lm(SBP ~ DBP, data = .) %>% summary
## 
## Call:
## lm(formula = SBP ~ DBP, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.834  -9.548  -1.263   7.380  22.451 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.2702    12.9268   4.895 9.30e-06 ***
## DBP           0.8571     0.1577   5.433 1.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 54 degrees of freedom
## Multiple R-squared:  0.3534, Adjusted R-squared:  0.3415 
## F-statistic: 29.52 on 1 and 54 DF,  p-value: 1.358e-06
data_dis_2 %>%
        lm(SBP ~ DBP, data = .) %>% summary
## 
## Call:
## lm(formula = SBP ~ DBP, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.463  -9.492  -0.478   8.515  37.552 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  58.3608     6.8509   8.519 9.35e-15 ***
## DBP           0.9015     0.0847  10.643  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.03 on 166 degrees of freedom
## Multiple R-squared:  0.4056, Adjusted R-squared:  0.402 
## F-statistic: 113.3 on 1 and 166 DF,  p-value: < 2.2e-16
data_dis_3 %>%
        lm(SBP ~ DBP, data = .) %>% summary
## 
## Call:
## lm(formula = SBP ~ DBP, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.435  -7.306   1.113   7.581  21.532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.5014    15.6245   3.680 0.000738 ***
## DBP           0.8709     0.2061   4.227 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.17 on 37 degrees of freedom
## Multiple R-squared:  0.3256, Adjusted R-squared:  0.3074 
## F-statistic: 17.86 on 1 and 37 DF,  p-value: 0.0001489
data_dis_4 %>%
        lm(SBP ~ DBP, data = .) %>% summary
## 
## Call:
## lm(formula = SBP ~ DBP, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.984  -4.388  -1.090   5.612  30.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.44194    2.37374   15.35   <2e-16 ***
## DBP          1.09932    0.03165   34.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.395 on 735 degrees of freedom
## Multiple R-squared:  0.6214, Adjusted R-squared:  0.6209 
## F-statistic:  1206 on 1 and 735 DF,  p-value: < 2.2e-16

QQplot……..skipping some

data_dis_1 %>%
        lm(SBP ~ DBP, data = .) %>%
        plot()

Plotting.. You can do that one by one. Else, the blow code can be used.

ggplot2::ggplot(dataset_sbp_small, aes(y = SBP, x = DBP, col = DIS)) +
        geom_point() +
        geom_smooth(method = "lm") +
        facet_wrap(~DIS)

Linaer mixed effect models

math_hwk_lmer <- lme4::lmer(data = imm10_data, math ~ homework + (1|schid))
math_hwk_lmer %>% anova
## Analysis of Variance Table
##          npar Sum Sq Mean Sq F value
## homework    1 2187.5  2187.5  33.903

This one does not result in p-values!

math_hwk_lmer <- lmerTest::lmer(data = imm10_data, math ~ homework + (1|schid))
math_hwk_lmer %>% anova
## Type III Analysis of Variance Table with Satterthwaite's method
##          Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
## homework 2187.5  2187.5     1 257.15  33.903 1.721e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lmerTest package need to be used instead.

plot(math_hwk_lmer)

To calculate QQ plot of residuals,

qqnorm(resid(math_hwk_lmer), main = "Q-Q residuals", ylab = "Standardized residuals")
qqline(resid(math_hwk_lmer))

as the lmer output (merMod object) is not a list. (They are S4 objects)

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.