Agenda
This lecture will practice running
simple linear models
multiple linear regression
Effect modification
ANOVA
Before begin..
Let’s load datasets.
the SBP dataset
dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
head(dataset_sbp)
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 116 78 94 4 16.6
## 2 1 1 100 60 79 4 22.3
## 3 1 1 100 60 87 4 21.9
## 4 1 1 111 70 72 4 20.2
## 5 1 1 120 80 98 4 20.0
## 6 1 1 115 79 95 4 23.1
Data manipulation
Making a new variable hypertension
dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 |
dataset_sbp$DBP > 80,
T,
F)
dataset_sbp$history_of_hypertension <- ifelse(dataset_sbp$DIS == 1 |
dataset_sbp$DIS == 2,
T,
F)
dataset_sbp$DIS <- factor(dataset_sbp$DIS,
levels = c(1,2,3,4),
labels = c("Hypertension and diabetes",
"Hypertension",
"Diabetes",
"No history"))
set.seed(1)
dataset_sbp_small <- subset(dataset_sbp, row.names(dataset_sbp) %in% sample(x = 1:1000000, size = 1000))
UCLA school dataset
Dataset is students from 10 handpicked schools, representing a subset of students and schools from a US survey of eight-grade students at 1000 schools (800 public 200 private).
There are quite a lot of variables in the dataset, but we are going start by using just three:
# Variable Type Len Pos Label
-----------------------------------------------------------------------------------------------
15 cstr Num 8 112
5 homework Num 8 32 Time spent on math homework each week
11 math Num 8 80 Math score
4 meanses Num 8 24 Mean SES for the school
7 parented Num 8 48 Parents highest education level
10 percmin Num 8 72 Percent minority in school
8 public Num 8 56 Public school: 1=public, 0=non-public
13 race Num 8 96 race of student, 1=asian, 2=Hispanic, 3=Black, 4=White, 5=Native American
9 ratio Num 8 64 Student-Teacher ratio
18 region Num 8 136
1 schid Num 8 0 School ID
19 schnum Num 8 144 group(schid)
16 scsize Num 8 120
14 sctype Num 8 104 Type of school, 1=public, 2=catholic, 3=Private
other religious, 4=Private non-r
3 ses Num 8 16 Socioecnonomic Status
12 sex Num 8 88 Sex: 1=male, 2=female
2 stuid Num 8 8 Student ID
17 urban Num 8 128
6 white Num 8 40 Race: 1=white, 0=non-white
Here,
homework: number of hours of homework - Level 1 variable (associated with students) math: score in a math test - expanatory variable schnum: a number for each school - Level 2 variable (associated with school) scsize: school size - Level 2 variable public: school type - Level 2 variable
library(haven)
imm10_data <- read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta")
Before analyzing..
You need to double-check if the regression is appropriate or not.
Simple linear regression
Step 1
You need to store the linear model first.
sbp_dbp_fit <- lm(data = dataset_sbp_small, SBP ~ DBP)
You can run ANOVA on the anlaysis result.
sbp_dbp_fit %>% anova
## Analysis of Variance Table
##
## Response: SBP
## Df Sum Sq Mean Sq F value Pr(>F)
## DBP 1 125712 125712 1396.7 < 2.2e-16 ***
## Residuals 998 89828 90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the stored data, you can extract multiple summarized informaiton.
sbp_dbp_fit %>% names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
You can choose the model using
sbp_dbp_fit$call
## lm(formula = SBP ~ DBP, data = dataset_sbp_small)
To laod the beta values,
sbp_dbp_fit$coefficients
## (Intercept) DBP
## 36.685904 1.116707
Choosing the numeric values in the lm() output can be conducted using double-brackets
sbp_dbp_fit$coefficients[[1]]
## [1] 36.6859
sbp_dbp_fit$coefficients[[2]]
## [1] 1.116707
…etc.
Step 2
You can generate a plot with linear regression
plot(SBP ~ DBP, data = dataset_sbp_small) +
abline(sbp_dbp_fit, col = "red") +
title("DBP vs SBP, 1000 person")
## integer(0)
Well, it seems like the data fits well…. but does the model is actually representing the association vetween SBP vs DBP or just the data?
Wroing example
Let’s assuem that our data is somehow having more data at the lower-end
hw_math_fit <- lm(data = imm10_data, math ~ homework)
plot(math ~ homework, data = imm10_data) +
abline(hw_math_fit, col = "red") +
title("Homework vs Math")
## integer(0)
imm10_data %>%
mutate(urban = as.factor(urban),
public = as.factor(public)) %>%
ggplot2::ggplot(aes(y = math, x = homework, col = urban,
shape = public)) +
geom_point()
Do you think the data is normally distributed?
Double-checking the distribution of data
SBP dataset
hist(dataset_sbp_small$SBP)
DBP data
hist(dataset_sbp_small$DBP)
Homework data of urban schools
imm10_data %>% filter(urban == 1) %>% .$homework %>% hist
Urban schools had right skewed data.
Do we need to generate these histograms, every time, for every variables that we are using?
Using plot() function
Using plot funcion you can explore data at once
imm10_data %>% plot
Using GGrally package
An improved version of those associations, can be double-checked using the below package
library(GGally)
ggpairs(imm10_data)
Adding normal curve to histogram
How can we compare them with normal distribution?
We can add normal curve to the data
hist(dataset_sbp_small$SBP, freq = F)
curve(dnorm(x,
mean= mean(dataset_sbp_small$SBP),
sd=sd(dataset_sbp_small$SBP)),
col="darkblue", lwd=2, add=TRUE)
SBP was quite normal.
hist(dataset_sbp_small$DBP, freq = F)
curve(dnorm(x,
mean= mean(dataset_sbp_small$DBP),
sd=sd(dataset_sbp_small$DBP)),
col="darkblue", lwd=2, add=TRUE)
DBP was somwhat normal..
hist(imm10_data$homework, freq = F, breaks = 9)
curve(dnorm(x,
mean= mean(imm10_data$homework),
sd=sd(imm10_data$homework)),
col="darkblue", lwd=2, add=TRUE)
Not normal at all
Deciding data is normal
We can generate a quantile of normal distribution and that of data! It woll show the different between normal curve and samples in more intuitive way.
Here, the x-axis is the quantile of a normal distribution of having the same mean and SD with dataset.
qqnorm(rnorm(n = 10000), pch = 1, frame = FALSE)
qqline(rnorm(n = 10000),
col = "red", lwd = 2)
Q-Q plot of normal data
Q-Q plot of SBP dataset
With SBP data, it looks like this.
qqnorm(dataset_sbp_small$SBP, pch = 1, frame = FALSE)
qqline(dataset_sbp_small$SBP,
col = "red", lwd = 2)
Here, the x-axis is the quantile of a normal distribution of having the
same mean and SD with SBP dataset.
Q-Q plot of DBP dataset
With DBP data, it looks like this.
qqnorm(dataset_sbp_small$DBP, pch = 1, frame = FALSE)
qqline(dataset_sbp_small$DBP,
col = "red", lwd = 2)
DBP is less normal than SBP. The error seems to high at higher quantile
values.
Q-Q plot of homework
qqnorm(imm10_data$homework, pch = 1, frame = FALSE)
qqline(imm10_data$homework,
col = "red", lwd = 2)
## Q-Q of math test results
qqnorm(imm10_data$math, pch = 1, frame = FALSE)
qqline(imm10_data$math,
col = "red", lwd = 2)
Obviously, the data showed unreasonable outcome.
Does it mean that our data is useless?
Evaluation after running lme
We can also plot the results of the linear model.
par(mfrow=c(2,2))
plot(sbp_dbp_fit)
par(mfrow=c(1,1))
Here,
- residual vs fillted shows the linearity between two variables
i.e., it investigates weather the residual is uniform across all the independent variable.
- Q-Q plot shows the normality of both data
This plots the qunatile of residuals. The residuals alsho has to have normal distribution. The theoretical residual can be calculated by sqrt(MSE) * z((k-0.375/(n+0.25))). Where the MSE is mean squares, k is the rank, n is the total sample number, and the z is tne normal distribtion’s pnorm().
Standardized version of plot 1 (variable omn the same scale).
residual v leverage plot. This calculates which one is outlier (based on assumption that the data is normally distributed).We only have 3 outliars for SBP data.
Doing the same thing for imm-dataset,
par(mfrow=c(2,2))
plot(hw_math_fit)
par(mfrow=c(1,1))
Although the distribution of data was not normal, the residuals were normal. So we can use the dataset.
Multiple linear regression
Now we are going to run multiple linear regression using R
sbp_dbp_fit_mlr <- lm(data = dataset_sbp_small, SBP ~ DBP + BMI + SEX + DIS)
Running anova
sbp_dbp_fit_mlr %>% anova
## Analysis of Variance Table
##
## Response: SBP
## Df Sum Sq Mean Sq F value Pr(>F)
## DBP 1 125712 125712 1535.552 < 2.2e-16 ***
## BMI 1 1536 1536 18.756 1.636e-05 ***
## SEX 1 138 138 1.683 0.1948
## DIS 3 6860 2287 27.933 < 2.2e-16 ***
## Residuals 993 81294 82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gender is irrelevant to our analysis - for now I am going to remove it from the analysis
view summary
sbp_dbp_fit_mlr %>% summary
##
## Call:
## lm(formula = SBP ~ DBP + BMI + SEX + DIS, data = dataset_sbp_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.479 -5.195 -0.890 5.838 39.023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.31409 3.45760 12.816 < 2e-16 ***
## DBP 1.01381 0.03070 33.023 < 2e-16 ***
## BMI 0.29627 0.08928 3.318 0.000939 ***
## SEX -0.89221 0.58841 -1.516 0.129762
## DISHypertension -0.88760 1.39876 -0.635 0.525860
## DISDiabetes -3.52533 1.89557 -1.860 0.063213 .
## DISNo history -7.13347 1.27451 -5.597 2.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.048 on 993 degrees of freedom
## Multiple R-squared: 0.6228, Adjusted R-squared: 0.6206
## F-statistic: 273.3 on 6 and 993 DF, p-value: < 2.2e-16
Can run the same thing with mlr
par(mfrow=c(2,2))
plot(sbp_dbp_fit_mlr)
par(mfrow=c(1,1))
To make adjusted plot
, You can choose the data as
below.
sbp_dbp_fit_mlr$coefficients["(Intercept)"][[1]]
## [1] 44.31409
sbp_dbp_fit_mlr$coefficients["DBP"][[1]]
## [1] 1.013805
Using the adjusted coefficients of the linear model, you can draw linear model.
plot(data = dataset_sbp_small, SBP~DBP,
main = "Simple linear model"
) +
abline(a = sbp_dbp_fit$coefficients["(Intercept)"][[1]],
b = sbp_dbp_fit$coefficients["DBP"][[1]],
col = "blue",
linewidth = 2)
## integer(0)
plot(data = dataset_sbp_small, SBP~DBP,
main = "BMI, DIS adjusted linear model\n(of BMI = 0, DIS group 1)"
) +
abline(a = sbp_dbp_fit_mlr$coefficients["(Intercept)"][[1]],
b = sbp_dbp_fit_mlr$coefficients["DBP"][[1]],
col = "red",
linewidth = 2)
## integer(0)
Effect modificaitons
Interaction terms
Interaction term can be added easily
sbp_dbp_fit_int <- lm(data = dataset_sbp_small, SBP ~ DBP + BMI + SEX * DIS)
Running anova
sbp_dbp_fit_int %>% anova
## Analysis of Variance Table
##
## Response: SBP
## Df Sum Sq Mean Sq F value Pr(>F)
## DBP 1 125712 125712 1553.4602 < 2.2e-16 ***
## BMI 1 1536 1536 18.9749 1.462e-05 ***
## SEX 1 138 138 1.7027 0.192244
## DIS 3 6860 2287 28.2586 < 2.2e-16 ***
## SEX:DIS 3 1180 393 4.8603 0.002318 **
## Residuals 990 80114 81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gender is irrelevant to our analysis - for now I am going to remove it from the analysis
view summary
sbp_dbp_fit_int %>% summary
##
## Call:
## lm(formula = SBP ~ DBP + BMI + SEX * DIS, data = dataset_sbp_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.374 -5.505 -1.075 5.805 40.956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.69544 4.71991 9.893 < 2e-16 ***
## DBP 1.00687 0.03058 32.929 < 2e-16 ***
## BMI 0.25843 0.08950 2.887 0.00397 **
## SEX -1.53479 2.51128 -0.611 0.54123
## DISHypertension -7.91562 4.22418 -1.874 0.06124 .
## DISDiabetes -11.25986 5.70926 -1.972 0.04886 *
## DISNo history -6.27973 3.76541 -1.668 0.09568 .
## SEX:DISHypertension 4.74479 2.86857 1.654 0.09843 .
## SEX:DISDiabetes 5.37686 3.83921 1.401 0.16167
## SEX:DISNo history -0.58478 2.60166 -0.225 0.82220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.996 on 990 degrees of freedom
## Multiple R-squared: 0.6283, Adjusted R-squared: 0.6249
## F-statistic: 185.9 on 9 and 990 DF, p-value: < 2.2e-16
Can run the same thing with mlr
par(mfrow=c(2,2))
plot(sbp_dbp_fit_mlr)
par(mfrow=c(1,1))
To make adjusted plot
, You can choose the data as
below.
sbp_dbp_fit_mlr$coefficients["(Intercept)"][[1]]
## [1] 44.31409
sbp_dbp_fit_mlr$coefficients["DBP"][[1]]
## [1] 1.013805
Using the adjusted coefficients of the linear model, you can draw linear model.
In ggplot package, geom_smooth(method = “lm”) will generate linear regression of each data groups at once!
ggplot2::ggplot(dataset_sbp_small, aes(y = SBP, x = DBP, col = DIS)) +
geom_point() +
geom_smooth(method = "lm")
Other example would be…
ggplot2::ggplot(imm10_data, aes(y = math, x = homework, col = as.factor(schnum))) +
geom_point() +
geom_smooth(method = "lm")
Stratification
Manucal
Data can be splited using following filter() function
data_dis_1 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 1)
data_dis_2 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 2)
data_dis_3 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 3)
data_dis_4 <- dataset_sbp_small %>% filter(as.numeric(dataset_sbp_small$DIS) == 4)
data_dis_1 %>%
lm(SBP ~ DBP, data = .) %>% summary
##
## Call:
## lm(formula = SBP ~ DBP, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.834 -9.548 -1.263 7.380 22.451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.2702 12.9268 4.895 9.30e-06 ***
## DBP 0.8571 0.1577 5.433 1.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.47 on 54 degrees of freedom
## Multiple R-squared: 0.3534, Adjusted R-squared: 0.3415
## F-statistic: 29.52 on 1 and 54 DF, p-value: 1.358e-06
data_dis_2 %>%
lm(SBP ~ DBP, data = .) %>% summary
##
## Call:
## lm(formula = SBP ~ DBP, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.463 -9.492 -0.478 8.515 37.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.3608 6.8509 8.519 9.35e-15 ***
## DBP 0.9015 0.0847 10.643 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.03 on 166 degrees of freedom
## Multiple R-squared: 0.4056, Adjusted R-squared: 0.402
## F-statistic: 113.3 on 1 and 166 DF, p-value: < 2.2e-16
data_dis_3 %>%
lm(SBP ~ DBP, data = .) %>% summary
##
## Call:
## lm(formula = SBP ~ DBP, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.435 -7.306 1.113 7.581 21.532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.5014 15.6245 3.680 0.000738 ***
## DBP 0.8709 0.2061 4.227 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.17 on 37 degrees of freedom
## Multiple R-squared: 0.3256, Adjusted R-squared: 0.3074
## F-statistic: 17.86 on 1 and 37 DF, p-value: 0.0001489
data_dis_4 %>%
lm(SBP ~ DBP, data = .) %>% summary
##
## Call:
## lm(formula = SBP ~ DBP, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.984 -4.388 -1.090 5.612 30.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.44194 2.37374 15.35 <2e-16 ***
## DBP 1.09932 0.03165 34.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.395 on 735 degrees of freedom
## Multiple R-squared: 0.6214, Adjusted R-squared: 0.6209
## F-statistic: 1206 on 1 and 735 DF, p-value: < 2.2e-16
QQplot……..skipping some
data_dis_1 %>%
lm(SBP ~ DBP, data = .) %>%
plot()
Plotting.. You can do that one by one. Else, the blow code can be used.
ggplot2::ggplot(dataset_sbp_small, aes(y = SBP, x = DBP, col = DIS)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DIS)
Linaer mixed effect models
math_hwk_lmer <- lme4::lmer(data = imm10_data, math ~ homework + (1|schid))
math_hwk_lmer %>% anova
## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## homework 1 2187.5 2187.5 33.903
This one does not result in p-values!
math_hwk_lmer <- lmerTest::lmer(data = imm10_data, math ~ homework + (1|schid))
math_hwk_lmer %>% anova
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## homework 2187.5 2187.5 1 257.15 33.903 1.721e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest
package need to be used instead.
plot(math_hwk_lmer)
To calculate QQ plot of residuals,
qqnorm(resid(math_hwk_lmer), main = "Q-Q residuals", ylab = "Standardized residuals")
qqline(resid(math_hwk_lmer))
as the lmer output (merMod
object) is not a
list
. (They are S4 objects)
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.