This article is a part of FETP biostatistic course. So I can not provide the dataset used in this article.
#Setting up environment with useful packages
library(tidyverse)
library(haven)
We can use regression model for many purposes like predict outcome from given predictors, explore relationship of many given factors, test some hypothesis, adjust variable for data analysis.This article will provide some examples for do these job with R.
Key assumptions when fitting a linear regression model are
1.Linearity between outcome and predictors
2.The outcome variable is normally distributed across predictor values
3.The variance of outcomes should be the same across predictor values
First let try simple linear regression model fit.
Aceclofenac<-read_dta("Pharmacokinetics_Aceclofenac_100.dta")
model1<-lm(peakarea~drugconc,data = Aceclofenac)
summary(model1)
##
## Call:
## lm(formula = peakarea ~ drugconc, data = Aceclofenac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1637.16 -95.89 157.11 309.11 1186.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2543.04 280.46 9.067 1.02e-06 ***
## drugconc 7601.54 15.17 500.952 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 826.6 on 12 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 0.9999
## F-statistic: 2.51e+05 on 1 and 12 DF, p-value: < 2.2e-16
##Check regression model assumptions.
par(mfrow=c(2,2))
plot(model1)
par(mfrow=c(1,1))
hist(model1$residuals,
main = "Histogram of model1 residuals",
xlab = "Residuals",
col = "pink")
As plots show, this model residuals are not normal distribution. So
This is not a good regression model because it not met the linear
regression assumptions.
In best practice we should check the the assumptions number 1 and 2
before fitting a linear regression model and check assumption number 3
after fitted model by examine model residuals distribution.
plot(Aceclofenac$peakarea~Aceclofenac$drugconc,
main = "Scatter plot of predicter and outcome",
xlab = "Drug concentration",
ylab = "Peak area")
hist(Aceclofenac$peakarea,
main = "Histogram of outcome",
xlab = "Peak area",
col = "cornflowerblue")
As above plots show, for model1, it violate assumption number 2, the
outcome is not normally distributed. Now let try another linear
regression and talk about model usage.
1.Comparison
lbw <- read_dta("lowbwt_dat.dta")
head(lbw)
## # A tibble: 6 × 13
## ID low age lwt race smoke ptl ht ui ftv bwt
## <dbl> <dbl+lbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl> <dbl+> <dbl+l> <dbl> <dbl>
## 1 85 0 [birth w… 19 182 2 [bla… 0 [no] 0 0 [no] 1 [yes] 0 2523
## 2 86 0 [birth w… 33 155 3 [oth… 0 [no] 0 0 [no] 0 [no] 3 2551
## 3 87 0 [birth w… 20 105 1 [whi… 1 [yes] 0 0 [no] 0 [no] 1 2557
## 4 88 0 [birth w… 21 108 1 [whi… 1 [yes] 0 0 [no] 1 [yes] 2 2594
## 5 89 0 [birth w… 18 107 1 [whi… 1 [yes] 0 0 [no] 1 [yes] 0 2600
## 6 91 0 [birth w… 21 124 3 [oth… 0 [no] 0 0 [no] 0 [no] 0 2622
## # … with 2 more variables: `_Irace_2` <dbl>, `_Irace_3` <dbl>
##Say that we want to examine the difference between baby weight (bwt) among 2 groups of baby's mother who are smoker and not smoker.
#We can use t-test for comparison.
t.test(lbw$bwt~lbw$smoke,var.equal = TRUE)
##
## Two Sample t-test
##
## data: lbw$bwt by lbw$smoke
## t = 2.6336, df = 187, p-value = 0.009156
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 70.69274 492.73382
## sample estimates:
## mean in group 0 mean in group 1
## 3054.957 2773.243
#set var.equal = TRUE for equal variances comparison.
According to p value 0.00916 mean that the difference of baby weight
whose
mother is a smoker and not a smoker is not statistical
significant.
We can also use linear regression model for comparison between groups
too.
model2<- lm(data = lbw,bwt~smoke)
summary(model2)
##
## Call:
## lm(formula = bwt ~ smoke, data = lbw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2064.24 -477.24 35.04 545.04 1935.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3054.96 66.93 45.642 < 2e-16 ***
## smoke -281.71 106.97 -2.634 0.00916 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 717.8 on 187 degrees of freedom
## Multiple R-squared: 0.03576, Adjusted R-squared: 0.03061
## F-statistic: 6.936 on 1 and 187 DF, p-value: 0.009156
The model2 also show p-value of smoke coefficients 0.00916 same as t-test.
2.Use linear regression for adjust variables.
model3<- lm(bwt~smoke+lwt,data = lbw)
summary(model3)
##
## Call:
## lm(formula = bwt ~ smoke + lwt, data = lbw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2030.16 -447.00 27.74 514.20 1968.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2500.174 230.833 10.831 <2e-16 ***
## smoke -270.013 105.590 -2.557 0.0114 *
## lwt 4.238 1.690 2.508 0.0130 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 707.8 on 186 degrees of freedom
## Multiple R-squared: 0.06731, Adjusted R-squared: 0.05728
## F-statistic: 6.711 on 2 and 186 DF, p-value: 0.001533
We can use multivariable linear regression model for adjusting
variables of interest. We can fit model more than one variable by use +
in lm function. Model3 is a linear regression model with 2 variable
mother’s weight (lwt) and mother’s smoking status (smoke), look at
coefficients we can interpret that
1.If mother smoke the baby weight will decrease by 270 grams compare to
mother who not smoke with the same age. 2.If mother’s weight increase 1
gram,the baby weight will increase 4.238 grams adjusted to smoke
status.
model4 <-lm(bwt~age+as.factor(race), data = lbw)
summary(model4)
##
## Call:
## lm(formula = bwt ~ age + as.factor(race), data = lbw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2129.72 -489.38 7.96 523.20 1758.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2953.505 255.216 11.573 <2e-16 ***
## age 6.185 10.067 0.614 0.5398
## as.factor(race)2 -367.020 160.550 -2.286 0.0234 *
## as.factor(race)3 -287.952 115.470 -2.494 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 715.3 on 185 degrees of freedom
## Multiple R-squared: 0.05268, Adjusted R-squared: 0.03732
## F-statistic: 3.429 on 3 and 185 DF, p-value: 0.01824
###race is numeric,which is a misinterpretation.
##Race should be transformed to factor before fitting into a model.
#White =1, Black=2, Asian=3
#So White is reference group.
We can interpret model 4 that after adjusted age effect 1.The baby who born with black race mother will has 367 grams lower weight compare to white race mother. 2.The baby who born with asian mother will has 287 grams lower weight compare to white race mother. 3.After adjusted race effect, the one year increase in mother age will increase baby weight by 6.185 grams.
3.Using linear regression for exploratory data
analysis.
In this purpose we can use linear regression model to examine which
predictors have positive or negative relation with outcome, which
predictors are significant predictors.
vietnam_kap <- read_dta("Vietnam_KAP_ Covid-19.dta")
head(vietnam_kap)
## # A tibble: 6 × 60
## No Code A1Gender Sex A2Age A3Educationlevel Education Region A4Area
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 VP01 Female 0 34 University & higher 2 Northern Vinhp…
## 2 2 VP02 Female 0 34 University & higher 2 Northern Vinhp…
## 3 3 VP03 Female 0 37 University & higher 2 Northern Vinhp…
## 4 4 VP04 Female 0 40 College 0 Northern Vinhp…
## 5 5 VP05 Female 0 34 University & higher 2 Northern Vinhp…
## 6 6 VP06 Female 0 37 University & higher 2 Northern Vinhp…
## # … with 51 more variables: Province <dbl+lbl>,
## # A5Workingexperienceindrugst <dbl>, B1ThevirusthatcausesCOVID <chr>,
## # B2Peoplecanbecomeinfectedb <chr>, B3Directcontactwiththebloo <chr>,
## # B4CommonsymptomsofCOVID19 <chr>, B5CommonsymptomsofCOVID19 <chr>,
## # B6CommonsymptomsofCOVID19 <chr>, B7Difficultybreathingandsho <chr>,
## # B8Sorethroatcanbeasymptom <chr>, B9Nauseaandvomittingcanbe <chr>,
## # B10Headachecanbeasymptomo <chr>, B11Bellyacheanddiarrheacan <chr>, …
##What variables relate to knowledge mark in this dataset.
#We can use multiple regression to answer this
model5<- lm(data = vietnam_kap, Knowledgemark~ A2Age+
as.factor(Education)+
as.factor(Province))
summary(model5)
##
## Call:
## lm(formula = Knowledgemark ~ A2Age + as.factor(Education) + as.factor(Province),
## data = vietnam_kap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3410 -0.5762 0.0696 0.7676 3.8122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.846476 0.309463 35.049 < 2e-16 ***
## A2Age 0.018274 0.005213 3.506 0.000476 ***
## as.factor(Education)1 -0.940544 0.119978 -7.839 1.14e-14 ***
## as.factor(Education)2 0.019517 0.113356 0.172 0.863336
## as.factor(Province)2 0.429371 0.284564 1.509 0.131644
## as.factor(Province)3 0.865883 0.275662 3.141 0.001732 **
## as.factor(Province)4 1.052352 0.274071 3.840 0.000131 ***
## as.factor(Province)5 0.514330 0.317508 1.620 0.105567
## as.factor(Province)6 1.007949 0.370463 2.721 0.006625 **
## as.factor(Province)7 2.035274 0.268416 7.583 7.66e-14 ***
## as.factor(Province)8 -0.090228 0.265923 -0.339 0.734452
## as.factor(Province)9 0.526401 0.266351 1.976 0.048387 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.4 on 1011 degrees of freedom
## Multiple R-squared: 0.2763, Adjusted R-squared: 0.2685
## F-statistic: 35.1 on 11 and 1011 DF, p-value: < 2.2e-16
Interpretation : According to p value the factors that statistical significant effect the knowledgemark are age, educated from middle school(Education2),live in province 3, 4, 6, 7 and 9. Educated from middle school and live in province 8 have a negative relationship with knowledgemark.
4.Using linear regression for hypothesis
testing.
If our predictor of interest is a category data its slope significant
level can tell us about the significant diference between group, which
we can use to test hypothesis. Say if we want to know is there a
different in total score for each teaching strategy. We can set null
hypothesis to “There is no difference in total score for each teaching
strategy group” and then fit linear model and look at teaching strategy
coefficient.
blend_learn <- read_dta("blended_learning_pharmacy_education.dta")
head(blend_learn,10)
## # A tibble: 10 × 26
## College learningstrategy group age gender sex pre post casestudy
## <dbl> <chr> <dbl+lbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9 bl 1 [experme… 22 f 0 36 43 30
## 2 9 bl 1 [experme… 22 f 0 30 46 30
## 3 9 bl 1 [experme… 22 f 0 25 50 34
## 4 9 bl 1 [experme… 22 m 1 32 38 14
## 5 9 bl 1 [experme… 22 m 1 24 49 34
## 6 9 bl 1 [experme… 22 f 0 42 49 34
## 7 9 bl 1 [experme… 23 f 0 25 43 32
## 8 9 bl 1 [experme… 22 f 0 20 50 34
## 9 9 bl 1 [experme… 23 f 0 36 50 34
## 10 9 bl 1 [experme… 23 f 0 27 50 24
## # … with 17 more variables: postpre <dbl>, total <dbl>, Intrin <dbl>,
## # EXTRINSIC <dbl>, TASK <dbl>, CONTROL <dbl>, SELFEFFFI <dbl>, ANXIETY <dbl>,
## # REHERSA <dbl>, ELABORA <dbl>, ORGANIZ <dbl>, CRITICAL <dbl>, METACO <dbl>,
## # TIME <dbl>, EFFORT <dbl>, PEER <dbl>, HELP <dbl>
model6<- lm(data = blend_learn, total~as.factor(group))
summary(model6)
##
## Call:
## lm(formula = total ~ as.factor(group), data = blend_learn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.113 -8.261 2.345 11.345 31.887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.261 1.741 36.908 < 2e-16 ***
## as.factor(group)2 -12.148 2.744 -4.427 1.46e-05 ***
## as.factor(group)3 -7.606 2.497 -3.045 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.7 on 238 degrees of freedom
## Multiple R-squared: 0.08132, Adjusted R-squared: 0.0736
## F-statistic: 10.53 on 2 and 238 DF, p-value: 4.134e-05
The teaching group coefficients show p value less that 0.05 we can
reject null hypothesis and conclude that there is statistical
significant different in total score for each teaching stategy group.And
we can do post hoc analysis if we like.
Another example.
treatment <- read_dta("pneumonia_treatment.dta")
head(treatment)
## # A tibble: 6 × 39
## ID age Sexo hospitalstay Stay_from_Start… Outcome Medicine Comorbilidad
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl+l> <dbl+lb> <dbl>
## 1 1 (80,9… 0 4 3 1 [ali… 1 [Meth… 1
## 2 2 (60,7… 1 4 2 1 [ali… 1 [Meth… 1
## 3 3 (80,9… 1 4 2 1 [ali… 1 [Meth… 1
## 4 4 (80,9… 1 7 4 1 [ali… 1 [Meth… 1
## 5 5 (90,+] 1 4 3 1 [ali… 1 [Meth… 1
## 6 6 (60,7… 1 6 5 1 [ali… 1 [Meth… 1
## # … with 31 more variables: Cancer <chr>, Hypothyroidism <chr>,
## # Heart_enf <chr>, Falla_Renal <chr>, Demencia <chr>, Obesity <chr>,
## # admisison_date <chr>, follow_30_d <chr>, status_inicial <chr>, Epoc <chr>,
## # Diabetes_mellitus <chr>, HTA <chr>, other_antibiotic <chr>,
## # azithromycin <chr>, Received_antibiotic <chr>, MTP_start_date <chr>,
## # received_colchicine <dbl>, pafi_inicial <dbl>, DHL_inicial <dbl>,
## # Ferritina_inicial <dbl>, Dimero_D_inicial <dbl>, PCR_inicial <dbl>, …
unique(treatment$Medicine)
## <labelled<double>[2]>: Medicine
## [1] 1 0
##
## Labels:
## value label
## 0 Dexamethasone
## 1 Methylprednisolone high dose
## Suppose we want to know is there a different between 2 drugs in effect of reducing viral load.
# Null hypothesis is "There is no different in effect of reducing viral load between drug 1 and2"
hist(treatment$PCR_inicial,
main = "Histogram of viral load at starting point",
xlab = "Viral load",
col = "darkseagreen1")
hist(treatment$PCR_posterior,
main = "Histogram of viral load at end point",
xlab = "Viral load",
col = "darkseagreen2")
###The PCR_inicial, PCR_posterior are skewed to the right.
##We will use log10 scale for these variable to transom to normal distribution.
treatment<- treatment%>%
mutate("logPCRpre"=log10(PCR_inicial),
"logPCRpost"=log10(PCR_posterior))
hist(treatment$logPCRpre,
main = "Histogram of viral load at starting point on log scale",
xlab = "Viral load",
col = "darkseagreen1")
hist(treatment$logPCRpost,
main = "Histogram of viral load at end point on log scale",
xlab = "Viral load",
col = "darkseagreen2")
##Now both variables look approximately normal distributed.
##Compare by linear regression.
model7 <- lm(data = treatment,logPCRpost~as.factor(Medicine)+logPCRpre)
summary(model7)
##
## Call:
## lm(formula = logPCRpost ~ as.factor(Medicine) + logPCRpre, data = treatment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00999 -0.21470 -0.00627 0.21561 0.93749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36988 0.08567 4.318 2.68e-05 ***
## as.factor(Medicine)1 -0.32229 0.05673 -5.681 5.74e-08 ***
## logPCRpre 0.42990 0.07278 5.907 1.87e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3699 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.2811, Adjusted R-squared: 0.2726
## F-statistic: 33.04 on 2 and 169 DF, p-value: 7.737e-13
After adjust with same baseline (logPCRpre), the coefficient of end point viral load (logPCRpost) in Medicine1 is a statistical significant different with p value <0.01. We can conclude that the different in effect of reducing viral load between 2 drug is a statistical significant.
5.Using linear regression for predict result.
This purpose is straightforward, if we have a multiple regression model
and later we have a new set of predictors which in the model we can
calculate the out come from this new predictors.
bodyfat<-read_dta("bodyfat.dta")
head(bodyfat)
## # A tibble: 6 × 8
## bodyfat age weight height neck abdomen knee ankle
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12.3 23 154. 67.8 36.2 85.2 37.3 21.9
## 2 6.10 22 173. 72.2 38.5 83 37.3 23.4
## 3 25.3 22 154 66.2 34 87.9 38.9 24
## 4 10.4 26 185. 72.2 37.4 86.4 37.3 22.8
## 5 28.7 24 184. 71.2 34.4 100 42.2 24
## 6 20.9 24 210. 74.8 39 94.4 42 25.6
##If we know age, weight, height, circumference of neck, abdomen, knee and ankle, so how much of our bodyfat?
cor(bodyfat)
## bodyfat age weight height neck abdomen
## bodyfat 1.0000000 0.37945710 0.58076654 -0.1823959 0.45721926 0.8413881
## age 0.3794571 1.00000000 -0.05870725 -0.2578485 0.08026069 0.2565489
## weight 0.5807665 -0.05870725 1.00000000 0.4509462 0.80235058 0.8553529
## height -0.1823959 -0.25784854 0.45094623 1.0000000 0.28007141 0.0979518
## neck 0.4572193 0.08026069 0.80235058 0.2800714 1.00000000 0.7066356
## abdomen 0.8413881 0.25654887 0.85535287 0.0979518 0.70663561 1.0000000
## knee 0.5426897 -0.01325209 0.82634474 0.3781211 0.65393628 0.7262406
## ankle 0.2040759 -0.09580741 0.49241541 0.3488076 0.37208511 0.3080827
## knee ankle
## bodyfat 0.54268967 0.20407594
## age -0.01325209 -0.09580741
## weight 0.82634474 0.49241541
## height 0.37812112 0.34880762
## neck 0.65393628 0.37208511
## abdomen 0.72624064 0.30808273
## knee 1.00000000 0.45430725
## ankle 0.45430725 1.00000000
corrplot::corrplot.mixed(cor(bodyfat),
upper="square",
lower="number",
addgrid.col = "black",
tl.col ="black",
tl.pos = "lt",
lower.col = "Black")
model8<-lm(data = bodyfat,bodyfat~.)
summary(model8)
##
## Call:
## lm(formula = bodyfat ~ ., data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1016 -2.9148 -0.7392 3.3351 8.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.63842 23.98381 -0.235 0.81474
## age 0.04397 0.03812 1.154 0.25212
## weight -0.07294 0.06756 -1.080 0.28356
## height -0.69440 0.26036 -2.667 0.00926 **
## neck -0.56477 0.33051 -1.709 0.09137 .
## abdomen 0.86684 0.12923 6.708 2.57e-09 ***
## knee 0.47829 0.36361 1.315 0.19214
## ankle 0.31150 0.25267 1.233 0.22126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.188 on 80 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.7931
## F-statistic: 48.64 on 7 and 80 DF, p-value: < 2.2e-16
Whole model p value is <0.01 and adjusted R square is 0.79, show this model is quite good. but only abdomen circumference and height are statistical significant predictors because of multi-collinearity.
##Backward elimination : Try remove insignificant variables just for easy purpose.
model9 <- lm(data = bodyfat, bodyfat~abdomen+height)
summary(model9)
##
## Call:
## lm(formula = bodyfat ~ abdomen + height, data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5383 -3.1899 -0.3873 3.1844 9.3872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.72632 13.09915 1.430 0.157
## abdomen 0.73287 0.04331 16.923 < 2e-16 ***
## height -0.96256 0.18456 -5.215 1.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.381 on 85 degrees of freedom
## Multiple R-squared: 0.7787, Adjusted R-squared: 0.7735
## F-statistic: 149.6 on 2 and 85 DF, p-value: < 2.2e-16
Adjusted R square of new model is 0.77 and p value of this model < 0.01, so new model is better than previous model because of simplicity.
Say we want to know how much of bodyfat if abdominal circumference is 100 and height is 70 we can calculate like this.
predict(model9,newdata = data.frame(abdomen=100,height=70))
## 1
## 24.63393
The answer is 24.63.