Basis linear regression usage

This article is a part of FETP biostatistic course. So I can not provide the dataset used in this article.

#Setting up environment with useful packages
library(tidyverse)
library(haven)

            We can use regression model for many purposes like predict outcome from given predictors, explore relationship of many given factors, test some hypothesis, adjust variable for data analysis.This article will provide some examples for do these job with R.    
            Key assumptions when fitting a linear regression model are
            1.Linearity between outcome and predictors  
            2.The outcome variable is normally distributed across predictor values  
            3.The variance of outcomes should be the same across predictor values

First let try simple linear regression model fit.

Aceclofenac<-read_dta("Pharmacokinetics_Aceclofenac_100.dta")

model1<-lm(peakarea~drugconc,data = Aceclofenac)
summary(model1)

## 
## Call:
## lm(formula = peakarea ~ drugconc, data = Aceclofenac)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1637.16   -95.89   157.11   309.11  1186.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2543.04     280.46   9.067 1.02e-06 ***
## drugconc     7601.54      15.17 500.952  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 826.6 on 12 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9999 
## F-statistic: 2.51e+05 on 1 and 12 DF,  p-value: < 2.2e-16

##Check regression model assumptions.
par(mfrow=c(2,2))
plot(model1)

par(mfrow=c(1,1))
hist(model1$residuals,
     main = "Histogram of model1 residuals",
     xlab = "Residuals",
     col = "pink")

As plots show, this model residuals are not normal distribution. So This is not a good regression model because it not met the linear regression assumptions.
In best practice we should check the the assumptions number 1 and 2 before fitting a linear regression model and check assumption number 3 after fitted model by examine model residuals distribution.

plot(Aceclofenac$peakarea~Aceclofenac$drugconc,
     main = "Scatter plot of predicter and outcome",
     xlab = "Drug concentration",
     ylab = "Peak area")

hist(Aceclofenac$peakarea,
     main = "Histogram of outcome",
     xlab = "Peak area",
     col = "cornflowerblue")

As above plots show, for model1, it violate assumption number 2, the outcome is not normally distributed. Now let try another linear regression and talk about model usage.
1.Comparison

lbw <- read_dta("lowbwt_dat.dta")
head(lbw)

## # A tibble: 6 × 13
##      ID         low   age   lwt    race   smoke   ptl     ht      ui   ftv   bwt
##   <dbl>   <dbl+lbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl> <dbl+> <dbl+l> <dbl> <dbl>
## 1    85 0 [birth w…    19   182 2 [bla… 0 [no]      0 0 [no] 1 [yes]     0  2523
## 2    86 0 [birth w…    33   155 3 [oth… 0 [no]      0 0 [no] 0 [no]      3  2551
## 3    87 0 [birth w…    20   105 1 [whi… 1 [yes]     0 0 [no] 0 [no]      1  2557
## 4    88 0 [birth w…    21   108 1 [whi… 1 [yes]     0 0 [no] 1 [yes]     2  2594
## 5    89 0 [birth w…    18   107 1 [whi… 1 [yes]     0 0 [no] 1 [yes]     0  2600
## 6    91 0 [birth w…    21   124 3 [oth… 0 [no]      0 0 [no] 0 [no]      0  2622
## # … with 2 more variables: `_Irace_2` <dbl>, `_Irace_3` <dbl>

##Say that we want to examine the difference between baby weight (bwt) among 2 groups of baby's mother who are smoker and not smoker.  
#We can use t-test for comparison.
t.test(lbw$bwt~lbw$smoke,var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  lbw$bwt by lbw$smoke
## t = 2.6336, df = 187, p-value = 0.009156
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   70.69274 492.73382
## sample estimates:
## mean in group 0 mean in group 1 
##        3054.957        2773.243

#set var.equal = TRUE for equal variances comparison.

According to p value 0.00916 mean that the difference of baby weight whose
mother is a smoker and not a smoker is not statistical significant.
We can also use linear regression model for comparison between groups too.

model2<- lm(data = lbw,bwt~smoke)
summary(model2)

## 
## Call:
## lm(formula = bwt ~ smoke, data = lbw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2064.24  -477.24    35.04   545.04  1935.04 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3054.96      66.93  45.642  < 2e-16 ***
## smoke        -281.71     106.97  -2.634  0.00916 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.8 on 187 degrees of freedom
## Multiple R-squared:  0.03576,    Adjusted R-squared:  0.03061 
## F-statistic: 6.936 on 1 and 187 DF,  p-value: 0.009156

The model2 also show p-value of smoke coefficients 0.00916 same as t-test.

2.Use linear regression for adjust variables.

model3<- lm(bwt~smoke+lwt,data = lbw)
summary(model3)

## 
## Call:
## lm(formula = bwt ~ smoke + lwt, data = lbw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2030.16  -447.00    27.74   514.20  1968.51 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2500.174    230.833  10.831   <2e-16 ***
## smoke       -270.013    105.590  -2.557   0.0114 *  
## lwt            4.238      1.690   2.508   0.0130 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 707.8 on 186 degrees of freedom
## Multiple R-squared:  0.06731,    Adjusted R-squared:  0.05728 
## F-statistic: 6.711 on 2 and 186 DF,  p-value: 0.001533

We can use multivariable linear regression model for adjusting variables of interest. We can fit model more than one variable by use + in lm function. Model3 is a linear regression model with 2 variable mother’s weight (lwt) and mother’s smoking status (smoke), look at coefficients we can interpret that
1.If mother smoke the baby weight will decrease by 270 grams compare to mother who not smoke with the same age. 2.If mother’s weight increase 1 gram,the baby weight will increase 4.238 grams adjusted to smoke status.

model4 <-lm(bwt~age+as.factor(race), data = lbw)
summary(model4)

## 
## Call:
## lm(formula = bwt ~ age + as.factor(race), data = lbw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2129.72  -489.38     7.96   523.20  1758.19 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2953.505    255.216  11.573   <2e-16 ***
## age                 6.185     10.067   0.614   0.5398    
## as.factor(race)2 -367.020    160.550  -2.286   0.0234 *  
## as.factor(race)3 -287.952    115.470  -2.494   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 715.3 on 185 degrees of freedom
## Multiple R-squared:  0.05268,    Adjusted R-squared:  0.03732 
## F-statistic: 3.429 on 3 and 185 DF,  p-value: 0.01824

###race is numeric,which is a misinterpretation.
##Race should be transformed to factor before fitting into a model.
#White =1, Black=2, Asian=3
#So White is reference group.

We can interpret model 4 that after adjusted age effect 1.The baby who born with black race mother will has 367 grams lower weight compare to white race mother. 2.The baby who born with asian mother will has 287 grams lower weight compare to white race mother. 3.After adjusted race effect, the one year increase in mother age will increase baby weight by 6.185 grams.

3.Using linear regression for exploratory data analysis.
In this purpose we can use linear regression model to examine which predictors have positive or negative relation with outcome, which predictors are significant predictors.

vietnam_kap <- read_dta("Vietnam_KAP_ Covid-19.dta")
head(vietnam_kap)

## # A tibble: 6 × 60
##      No Code  A1Gender   Sex A2Age A3Educationlevel    Education Region   A4Area
##   <dbl> <chr> <chr>    <dbl> <dbl> <chr>                   <dbl> <chr>    <chr> 
## 1     1 VP01  Female       0    34 University & higher         2 Northern Vinhp…
## 2     2 VP02  Female       0    34 University & higher         2 Northern Vinhp…
## 3     3 VP03  Female       0    37 University & higher         2 Northern Vinhp…
## 4     4 VP04  Female       0    40 College                     0 Northern Vinhp…
## 5     5 VP05  Female       0    34 University & higher         2 Northern Vinhp…
## 6     6 VP06  Female       0    37 University & higher         2 Northern Vinhp…
## # … with 51 more variables: Province <dbl+lbl>,
## #   A5Workingexperienceindrugst <dbl>, B1ThevirusthatcausesCOVID <chr>,
## #   B2Peoplecanbecomeinfectedb <chr>, B3Directcontactwiththebloo <chr>,
## #   B4CommonsymptomsofCOVID19 <chr>, B5CommonsymptomsofCOVID19 <chr>,
## #   B6CommonsymptomsofCOVID19 <chr>, B7Difficultybreathingandsho <chr>,
## #   B8Sorethroatcanbeasymptom <chr>, B9Nauseaandvomittingcanbe <chr>,
## #   B10Headachecanbeasymptomo <chr>, B11Bellyacheanddiarrheacan <chr>, …

##What variables relate to knowledge mark in this dataset.
#We can use multiple regression to answer this

model5<- lm(data = vietnam_kap, Knowledgemark~ A2Age+
                  as.factor(Education)+
                  as.factor(Province))
summary(model5)

## 
## Call:
## lm(formula = Knowledgemark ~ A2Age + as.factor(Education) + as.factor(Province), 
##     data = vietnam_kap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3410 -0.5762  0.0696  0.7676  3.8122 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           10.846476   0.309463  35.049  < 2e-16 ***
## A2Age                  0.018274   0.005213   3.506 0.000476 ***
## as.factor(Education)1 -0.940544   0.119978  -7.839 1.14e-14 ***
## as.factor(Education)2  0.019517   0.113356   0.172 0.863336    
## as.factor(Province)2   0.429371   0.284564   1.509 0.131644    
## as.factor(Province)3   0.865883   0.275662   3.141 0.001732 ** 
## as.factor(Province)4   1.052352   0.274071   3.840 0.000131 ***
## as.factor(Province)5   0.514330   0.317508   1.620 0.105567    
## as.factor(Province)6   1.007949   0.370463   2.721 0.006625 ** 
## as.factor(Province)7   2.035274   0.268416   7.583 7.66e-14 ***
## as.factor(Province)8  -0.090228   0.265923  -0.339 0.734452    
## as.factor(Province)9   0.526401   0.266351   1.976 0.048387 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.4 on 1011 degrees of freedom
## Multiple R-squared:  0.2763, Adjusted R-squared:  0.2685 
## F-statistic:  35.1 on 11 and 1011 DF,  p-value: < 2.2e-16

Interpretation : According to p value the factors that statistical significant effect the knowledgemark are age, educated from middle school(Education2),live in province 3, 4, 6, 7 and 9. Educated from middle school and live in province 8 have a negative relationship with knowledgemark.

4.Using linear regression for hypothesis testing.
If our predictor of interest is a category data its slope significant level can tell us about the significant diference between group, which we can use to test hypothesis. Say if we want to know is there a different in total score for each teaching strategy. We can set null hypothesis to “There is no difference in total score for each teaching strategy group” and then fit linear model and look at teaching strategy coefficient.

blend_learn <- read_dta("blended_learning_pharmacy_education.dta")

head(blend_learn,10)

## # A tibble: 10 × 26
##    College learningstrategy       group   age gender   sex   pre  post casestudy
##      <dbl> <chr>              <dbl+lbl> <dbl> <chr>  <dbl> <dbl> <dbl>     <dbl>
##  1       9 bl               1 [experme…    22 f          0    36    43        30
##  2       9 bl               1 [experme…    22 f          0    30    46        30
##  3       9 bl               1 [experme…    22 f          0    25    50        34
##  4       9 bl               1 [experme…    22 m          1    32    38        14
##  5       9 bl               1 [experme…    22 m          1    24    49        34
##  6       9 bl               1 [experme…    22 f          0    42    49        34
##  7       9 bl               1 [experme…    23 f          0    25    43        32
##  8       9 bl               1 [experme…    22 f          0    20    50        34
##  9       9 bl               1 [experme…    23 f          0    36    50        34
## 10       9 bl               1 [experme…    23 f          0    27    50        24
## # … with 17 more variables: postpre <dbl>, total <dbl>, Intrin <dbl>,
## #   EXTRINSIC <dbl>, TASK <dbl>, CONTROL <dbl>, SELFEFFFI <dbl>, ANXIETY <dbl>,
## #   REHERSA <dbl>, ELABORA <dbl>, ORGANIZ <dbl>, CRITICAL <dbl>, METACO <dbl>,
## #   TIME <dbl>, EFFORT <dbl>, PEER <dbl>, HELP <dbl>

model6<- lm(data = blend_learn, total~as.factor(group))
summary(model6)

## 
## Call:
## lm(formula = total ~ as.factor(group), data = blend_learn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.113  -8.261   2.345  11.345  31.887 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         64.261      1.741  36.908  < 2e-16 ***
## as.factor(group)2  -12.148      2.744  -4.427 1.46e-05 ***
## as.factor(group)3   -7.606      2.497  -3.045  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.7 on 238 degrees of freedom
## Multiple R-squared:  0.08132,    Adjusted R-squared:  0.0736 
## F-statistic: 10.53 on 2 and 238 DF,  p-value: 4.134e-05

The teaching group coefficients show p value less that 0.05 we can reject null hypothesis and conclude that there is statistical significant different in total score for each teaching stategy group.And we can do post hoc analysis if we like.
Another example.

treatment <- read_dta("pneumonia_treatment.dta")
head(treatment)

## # A tibble: 6 × 39
##      ID age     Sexo hospitalstay Stay_from_Start… Outcome Medicine Comorbilidad
##   <dbl> <chr>  <dbl>        <dbl>            <dbl> <dbl+l> <dbl+lb>        <dbl>
## 1     1 (80,9…     0            4                3 1 [ali… 1 [Meth…            1
## 2     2 (60,7…     1            4                2 1 [ali… 1 [Meth…            1
## 3     3 (80,9…     1            4                2 1 [ali… 1 [Meth…            1
## 4     4 (80,9…     1            7                4 1 [ali… 1 [Meth…            1
## 5     5 (90,+]     1            4                3 1 [ali… 1 [Meth…            1
## 6     6 (60,7…     1            6                5 1 [ali… 1 [Meth…            1
## # … with 31 more variables: Cancer <chr>, Hypothyroidism <chr>,
## #   Heart_enf <chr>, Falla_Renal <chr>, Demencia <chr>, Obesity <chr>,
## #   admisison_date <chr>, follow_30_d <chr>, status_inicial <chr>, Epoc <chr>,
## #   Diabetes_mellitus <chr>, HTA <chr>, other_antibiotic <chr>,
## #   azithromycin <chr>, Received_antibiotic <chr>, MTP_start_date <chr>,
## #   received_colchicine <dbl>, pafi_inicial <dbl>, DHL_inicial <dbl>,
## #   Ferritina_inicial <dbl>, Dimero_D_inicial <dbl>, PCR_inicial <dbl>, …

unique(treatment$Medicine)

## <labelled<double>[2]>: Medicine
## [1] 1 0
## 
## Labels:
##  value                        label
##      0                Dexamethasone
##      1 Methylprednisolone high dose

## Suppose we want to know is there a different between 2 drugs in effect of reducing viral load.
# Null hypothesis is "There is no different in effect of reducing viral load between drug 1 and2"
hist(treatment$PCR_inicial,
     main = "Histogram of viral load at starting point",
     xlab = "Viral load",
     col = "darkseagreen1")

hist(treatment$PCR_posterior,
     main = "Histogram of viral load at end point",
     xlab = "Viral load",
     col = "darkseagreen2")

###The PCR_inicial, PCR_posterior are skewed to the right.
##We will use log10 scale for these variable to transom to normal distribution.
treatment<- treatment%>%
                  mutate("logPCRpre"=log10(PCR_inicial),
                         "logPCRpost"=log10(PCR_posterior))
hist(treatment$logPCRpre,
     main = "Histogram of viral load at starting point on log scale",
     xlab = "Viral load",
     col = "darkseagreen1")

hist(treatment$logPCRpost,
     main = "Histogram of viral load at end point on log scale",
     xlab = "Viral load",
     col = "darkseagreen2")

##Now both variables look approximately normal distributed.

##Compare by linear regression.
model7 <- lm(data = treatment,logPCRpost~as.factor(Medicine)+logPCRpre)
summary(model7)

## 
## Call:
## lm(formula = logPCRpost ~ as.factor(Medicine) + logPCRpre, data = treatment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00999 -0.21470 -0.00627  0.21561  0.93749 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.36988    0.08567   4.318 2.68e-05 ***
## as.factor(Medicine)1 -0.32229    0.05673  -5.681 5.74e-08 ***
## logPCRpre             0.42990    0.07278   5.907 1.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3699 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.2811, Adjusted R-squared:  0.2726 
## F-statistic: 33.04 on 2 and 169 DF,  p-value: 7.737e-13

After adjust with same baseline (logPCRpre), the coefficient of end point viral load (logPCRpost) in Medicine1 is a statistical significant different with p value <0.01. We can conclude that the different in effect of reducing viral load between 2 drug is a statistical significant.

5.Using linear regression for predict result.
This purpose is straightforward, if we have a multiple regression model and later we have a new set of predictors which in the model we can calculate the out come from this new predictors.

bodyfat<-read_dta("bodyfat.dta")
head(bodyfat)

## # A tibble: 6 × 8
##   bodyfat   age weight height  neck abdomen  knee ankle
##     <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1   12.3     23   154.   67.8  36.2    85.2  37.3  21.9
## 2    6.10    22   173.   72.2  38.5    83    37.3  23.4
## 3   25.3     22   154    66.2  34      87.9  38.9  24  
## 4   10.4     26   185.   72.2  37.4    86.4  37.3  22.8
## 5   28.7     24   184.   71.2  34.4   100    42.2  24  
## 6   20.9     24   210.   74.8  39      94.4  42    25.6

##If we know age, weight, height, circumference of neck, abdomen, knee and ankle, so how much of our bodyfat?
cor(bodyfat)

##            bodyfat         age      weight     height       neck   abdomen
## bodyfat  1.0000000  0.37945710  0.58076654 -0.1823959 0.45721926 0.8413881
## age      0.3794571  1.00000000 -0.05870725 -0.2578485 0.08026069 0.2565489
## weight   0.5807665 -0.05870725  1.00000000  0.4509462 0.80235058 0.8553529
## height  -0.1823959 -0.25784854  0.45094623  1.0000000 0.28007141 0.0979518
## neck     0.4572193  0.08026069  0.80235058  0.2800714 1.00000000 0.7066356
## abdomen  0.8413881  0.25654887  0.85535287  0.0979518 0.70663561 1.0000000
## knee     0.5426897 -0.01325209  0.82634474  0.3781211 0.65393628 0.7262406
## ankle    0.2040759 -0.09580741  0.49241541  0.3488076 0.37208511 0.3080827
##                knee       ankle
## bodyfat  0.54268967  0.20407594
## age     -0.01325209 -0.09580741
## weight   0.82634474  0.49241541
## height   0.37812112  0.34880762
## neck     0.65393628  0.37208511
## abdomen  0.72624064  0.30808273
## knee     1.00000000  0.45430725
## ankle    0.45430725  1.00000000

corrplot::corrplot.mixed(cor(bodyfat),
                   upper="square",
                   lower="number",
                   addgrid.col = "black",
                   tl.col ="black",
                   tl.pos = "lt",
                   lower.col = "Black")

model8<-lm(data = bodyfat,bodyfat~.)
summary(model8)

## 
## Call:
## lm(formula = bodyfat ~ ., data = bodyfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1016 -2.9148 -0.7392  3.3351  8.4197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.63842   23.98381  -0.235  0.81474    
## age          0.04397    0.03812   1.154  0.25212    
## weight      -0.07294    0.06756  -1.080  0.28356    
## height      -0.69440    0.26036  -2.667  0.00926 ** 
## neck        -0.56477    0.33051  -1.709  0.09137 .  
## abdomen      0.86684    0.12923   6.708 2.57e-09 ***
## knee         0.47829    0.36361   1.315  0.19214    
## ankle        0.31150    0.25267   1.233  0.22126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.188 on 80 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.7931 
## F-statistic: 48.64 on 7 and 80 DF,  p-value: < 2.2e-16

Whole model p value is <0.01 and adjusted R square is 0.79, show this model is quite good. but only abdomen circumference and height are statistical significant predictors because of multi-collinearity.

##Backward elimination : Try remove insignificant variables just for easy purpose.
model9 <- lm(data = bodyfat, bodyfat~abdomen+height)
summary(model9)

## 
## Call:
## lm(formula = bodyfat ~ abdomen + height, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5383  -3.1899  -0.3873   3.1844   9.3872 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.72632   13.09915   1.430    0.157    
## abdomen      0.73287    0.04331  16.923  < 2e-16 ***
## height      -0.96256    0.18456  -5.215 1.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.381 on 85 degrees of freedom
## Multiple R-squared:  0.7787, Adjusted R-squared:  0.7735 
## F-statistic: 149.6 on 2 and 85 DF,  p-value: < 2.2e-16

Adjusted R square of new model is 0.77 and p value of this model < 0.01, so new model is better than previous model because of simplicity.

Say we want to know how much of bodyfat if abdominal circumference is 100 and height is 70 we can calculate like this.

predict(model9,newdata = data.frame(abdomen=100,height=70))

##        1 
## 24.63393

The answer is 24.63.

Basis linear regression usage

Jirapanakorn Sutham

2022-06-28