Linear Regression Models

Interpreting the linear regresion results

In this assignment I am going to use the “nassCDS” data set about car crashes in United States from 1997 - 2002. I will design linear models in order to conduct a set of regression analyses and interpret the results.

Loading packages

library(DAAG)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(texreg)
library(visreg)
library(sjmisc)

Preview of the “nassCDS” dataset

data(nassCDS)
head(nassCDS)
##   dvcat  weight  dead airbag seatbelt frontal sex ageOFocc yearacc yearVeh
## 1 25-39  25.069 alive   none   belted       1   f       26    1997    1990
## 2 10-24  25.069 alive airbag   belted       1   f       72    1997    1995
## 3 10-24  32.379 alive   none     none       1   f       69    1997    1988
## 4 25-39 495.444 alive airbag   belted       1   f       53    1997    1995
## 5 25-39  25.069 alive   none   belted       1   f       32    1997    1988
## 6 40-54  25.069 alive   none   belted       1   f       22    1997    1985
##     abcat occRole deploy injSeverity caseid
## 1 unavail  driver      0           3  2:3:1
## 2  deploy  driver      1           1  2:3:2
## 3 unavail  driver      0           4  2:5:1
## 4  deploy  driver      1           1 2:10:1
## 5 unavail  driver      0           3 2:11:1
## 6 unavail  driver      0           3 2:11:2

Changing the independant variables from characters to numeric. Creating new variables needed in this analysis

nassCDS <- nassCDS %>% 
  mutate(dead1 = as.integer(dead)) %>%
  mutate(seatbelt1 = as.integer(seatbelt)) %>%
  mutate (airbag1 = as.integer(airbag)) %>%
  mutate (sex1 = as.integer(sex))
nassCDS <- nassCDS %>% 
  select(dead, dead1, seatbelt, seatbelt1, airbag, airbag1,sex, sex1, everything())

s <- NA
nassCDS$speed <- s
nassCDS$speed[nassCDS$dvcat == "1-9km/h"] <- 1
nassCDS$speed[nassCDS$dvcat == "10-24"] <- 2
nassCDS$speed[nassCDS$dvcat == "25-39"] <- 3
nassCDS$speed[nassCDS$dvcat == "40-54"] <- 4
nassCDS$speed[nassCDS$dvcat == "55+"] <- 5
summary(nassCDS$speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.685   3.000   5.000
nassCDS$carage <- (nassCDS$yearacc-(nassCDS$yearVeh-1))
summary(nassCDS$carage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   7.000   7.751  11.000  50.000       1

Preview of the data with the variables changed to numeric (dead - dead1, seatbelt - seatbelt1, airbag - airbag1, sex -sex1)

head(nassCDS)
##    dead dead1 seatbelt seatbelt1 airbag airbag1 sex sex1 dvcat  weight
## 1 alive     1   belted         2   none       1   f    1 25-39  25.069
## 2 alive     1   belted         2 airbag       2   f    1 10-24  25.069
## 3 alive     1     none         1   none       1   f    1 10-24  32.379
## 4 alive     1   belted         2 airbag       2   f    1 25-39 495.444
## 5 alive     1   belted         2   none       1   f    1 25-39  25.069
## 6 alive     1   belted         2   none       1   f    1 40-54  25.069
##   frontal ageOFocc yearacc yearVeh   abcat occRole deploy injSeverity
## 1       1       26    1997    1990 unavail  driver      0           3
## 2       1       72    1997    1995  deploy  driver      1           1
## 3       1       69    1997    1988 unavail  driver      0           4
## 4       1       53    1997    1995  deploy  driver      1           1
## 5       1       32    1997    1988 unavail  driver      0           3
## 6       1       22    1997    1985 unavail  driver      0           3
##   caseid speed carage
## 1  2:3:1     3      8
## 2  2:3:2     2      3
## 3  2:5:1     2     10
## 4 2:10:1     3      3
## 5 2:11:1     3     10
## 6 2:11:2     4     13

Recoding and creating new variables with the correct 0/1 assigned values

nassCDS <- nassCDS %>% 
  mutate(alive = sjmisc::rec(dead1, rec = "2=0; 1=1")) %>% 
  mutate (belted = sjmisc::rec(seatbelt1, rec = "2=1; 1=0")) %>% 
  mutate (airbagcar = sjmisc::rec(airbag1, rec = "1=0; 2=1")) %>% 
  mutate (gender = sjmisc::rec(sex1, rec = "1=0; 2=1")) %>%
select(dead, alive, seatbelt, belted, airbag, airbagcar, sex, gender, dvcat, speed, everything()) %>% 
select(-dead1, -seatbelt1, -airbag1, -sex1)
head(nassCDS)
##    dead alive seatbelt belted airbag airbagcar sex gender dvcat speed
## 1 alive     1   belted      1   none         0   f      0 25-39     3
## 2 alive     1   belted      1 airbag         1   f      0 10-24     2
## 3 alive     1     none      0   none         0   f      0 10-24     2
## 4 alive     1   belted      1 airbag         1   f      0 25-39     3
## 5 alive     1   belted      1   none         0   f      0 25-39     3
## 6 alive     1   belted      1   none         0   f      0 40-54     4
##    weight frontal ageOFocc yearacc yearVeh   abcat occRole deploy
## 1  25.069       1       26    1997    1990 unavail  driver      0
## 2  25.069       1       72    1997    1995  deploy  driver      1
## 3  32.379       1       69    1997    1988 unavail  driver      0
## 4 495.444       1       53    1997    1995  deploy  driver      1
## 5  25.069       1       32    1997    1988 unavail  driver      0
## 6  25.069       1       22    1997    1985 unavail  driver      0
##   injSeverity caseid carage
## 1           3  2:3:1      8
## 2           1  2:3:2      3
## 3           4  2:5:1     10
## 4           1 2:10:1      3
## 5           3 2:11:1     10
## 6           3 2:11:2     13

Running the first linear model with the interaction effect between two independant variables

lm.1 <- lm(carage ~ yearacc + belted * airbagcar, data = nassCDS)
summary(lm.1)
## 
## Call:
## lm(formula = carage ~ yearacc + belted * airbagcar, data = nassCDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.859  -2.317  -0.352   1.687  35.632 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.004e+03  2.634e+01  -38.12   <2e-16 ***
## yearacc           5.088e-01  1.318e-02   38.62   <2e-16 ***
## belted           -1.434e+00  6.779e-02  -21.16   <2e-16 ***
## airbagcar        -9.016e+00  8.282e-02 -108.86   <2e-16 ***
## belted:airbagcar  1.413e+00  9.801e-02   14.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.557 on 26211 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.557,  Adjusted R-squared:  0.5569 
## F-statistic:  8239 on 4 and 26211 DF,  p-value: < 2.2e-16

Interpretation of the first linear regression results

In this linear regression (lm.1), we looked at the effect of the year of the accident, seat belts and the airbag (independent variables) on the age of the vehicle(carage = dependent variable). The model includes the interaction between having the seat belts belted and having the airbag in the car and its effect on the prediction of the age of the car. We can notice that all the relationships are statistically significant by looking at the p-value.

The intercept starts at a value of -1004,24781. This number estimate the average value for Y when X = 0.

yearacc value equals to 0.50880. It means that for each year of accident the age of the car increases 0.50880. For example, lets say that the accident occurred in 2000. To estimate the age of the car based on this information we have to multiply the number 0.50880 with 2000 and then add the value of the intercept to this equation:

0.50880 * 2000 + (-1004.24781) = 13.35219

The car which had an accident in 2000, is estimated to be 13.35219 years old (without checking for other possible coefficients)

belted value equals to -1.43425.(1= belted, 0 = non belted). It means that if the seat belts were belted the age of the car decreases 1.43425. If the belts were not belted, we don’t take this number into consideration while estimating the age of the car.

airbagcar value equals to -9.01625. (1= airbag, 0 = no airbag). It means that if there was an airbag in a car, the age of the car decreases 9.01625. If there was no airbag in the car, we don’t take this number into consideration while estimating the age of the car.

belted:airbagcar value equals 1.41287. It means that if both the seat belts were belted and there was an airbag in a car, the age of the car in a given year of the accident increases 1.41287.

Dataset that shows the variables involved in the first interaction model

The data1 below shows the variables involved in the interaction which confirms the results of the linear regression. The data1 is grouped by the independent variables which are :the year of accident (yearacc), seatbelt and airbag and the mean value of dependent variable which is the mean age of the car (carage). The table shows the mean age of the car for each year of the accident. We can notice how the mean age of the car changes depending on other independent variables. We can also see that the mean age of the car without the seat belts and airbag is close to 13.35219 which proves the equation and correctness of the regression model number 1 (lm.1).

data1 <- nassCDS %>%
  select(carage, yearacc, seatbelt, airbag) %>%
  group_by(yearacc, seatbelt, airbag) %>%
  summarise(mean=mean(carage))
print (data1)
## # A tibble: 24 x 4
## # Groups:   yearacc, seatbelt [?]
##    yearacc seatbelt airbag      mean
##      <dbl>   <fctr> <fctr>     <dbl>
##  1    1997     none   none 11.404894
##  2    1997     none airbag  3.321429
##  3    1997   belted   none  9.778605
##  4    1997   belted airbag        NA
##  5    1998     none   none 12.091541
##  6    1998     none airbag  3.452489
##  7    1998   belted   none 10.731017
##  8    1998   belted airbag  3.485332
##  9    1999     none   none 13.059701
## 10    1999     none airbag  3.852459
## # ... with 14 more rows

Plotting the graphs and visuals that confirms the regression results

visreg(lm.1, "yearacc", scale = "response")
## Conditions used in construction of plot
## belted: 1
## airbagcar: 1

visreg(lm.1, "yearacc", by = "airbagcar", scale = "response")

Running the second linear model:

lm.2<- lm(ageOFocc ~ alive + factor(speed), data=nassCDS)
summary(lm.2)
## 
## Call:
## lm(formula = ageOFocc ~ alive + factor(speed), data = nassCDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.336 -14.708  -3.948  10.292  65.866 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     46.6795     0.8767  53.243  < 2e-16 ***
## alive          -10.3882     0.5582 -18.612  < 2e-16 ***
## factor(speed)2   1.6568     0.6960   2.380   0.0173 *  
## factor(speed)3   0.4170     0.7061   0.591   0.5548    
## factor(speed)4  -1.7765     0.7547  -2.354   0.0186 *  
## factor(speed)5  -5.1573     0.8334  -6.188 6.18e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.76 on 26211 degrees of freedom
## Multiple R-squared:  0.01672,    Adjusted R-squared:  0.01653 
## F-statistic: 89.12 on 5 and 26211 DF,  p-value: < 2.2e-16

Interpretation of the second linear regression results:

In this linear regression (lm.2) we are looking at the effects of speed of driving and being able to survive the car accident (independent variables) that affect the age of the occupant of the car (dependent variable). In other worlds, we want to predict the age of the occupant based on the speed of the car during the accident and ability to survive the crash.

The intercept starts at a value of 46.6795 This number estimate the average value for Y when X = 0.

alive value equals to -10.3882. (1= alive, 0 = dead) It means that if the person survived the car crash, we subtract 10.3882 from the “starting” age of the person given in an intercept. 46.6795 is an approximate age of a person who didn’t survive the car accident (without taking into consideration the speed of driving) 46.6795 - 10.3882 = 36,2913 is an approximate age of a person who survived the car accident (without taking into consideration the speed of driving)

speed The variable speed is coded so that: 1 indicates 1-9km/h, 2 = 10-24 km/h, 3 = 25-39 km/h, 4 = 40-54 km/h, 5 = 55+ km/h.

factor(speed)2 If the person was driving with a speed between 10-24 km/h, we add 1.6568 to the age of the occupant. (Older people tend to drive slower than younger people) To count the age of the person who survived the accident and was driving between 10-24 km/h, we would count it as below:

46.6795 - 10.3882 + 1.6568 = 37.9481 (average age of the person)

factor(speed)3 If the person was driving with a speed between 25-39 km/h, we add 0.4170 to the age of the occupant.

factor(speed)4 If the person was driving with a speed between 40-54 km/h, we subtract 1.7765 (add -1.7765) from the age of the occupant. (The age of the person decreases because younger people tend to drive faster than older people)

factor(speed)5 If the person was driving with a speed of 55+ km/h, we subtract 5.1573 from the age of the occupant.

Dataset that shows the variables involved in the second interaction model

The data2 below presents the variables involved in the interaction and confirms the validity of the regression results. We can notice that the average age of the people who survived the car accident 36 and the average age for the people who did not survive the accident is 47. The data2 also show that the average age of people who drives faster (speed 5) is smaller than the age of people who drives slower (speed 1).

data2 <- nassCDS %>%
  select(ageOFocc, dead, speed) %>%
  group_by(dead, speed) %>%
  summarise(mean=mean(ageOFocc))
print (data2)
## # A tibble: 10 x 3
## # Groups:   dead [?]
##      dead speed     mean
##    <fctr> <dbl>    <dbl>
##  1  alive     1 36.35871
##  2  alive     2 37.92359
##  3  alive     3 36.54475
##  4  alive     4 34.76795
##  5  alive     5 31.96286
##  6   dead     1 31.33333
##  7   dead     2 51.07018
##  8   dead     3 51.35197
##  9   dead     4 42.96512
## 10   dead     5 39.37108

Plotting the graphs and visuals that confirms the regression results

plot2 <- ggplot(data = nassCDS, aes (x = ageOFocc, y = speed)) + geom_col() + scale_y_continuous(breaks=c(seq(1,5)))
plot2

visreg(lm.2, "alive", scale = "response")

Running the third linear model:

lm.3 <- lm(carage ~ gender + alive, data = nassCDS)
summary(lm.3)
## 
## Call:
## lm(formula = carage ~ gender + alive, data = nassCDS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.363 -4.182 -1.145  3.674 42.855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.3256     0.1597  52.126  < 2e-16 ***
## gender        1.0374     0.0658  15.766  < 2e-16 ***
## alive        -1.1805     0.1583  -7.456  9.2e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.312 on 26213 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01176,    Adjusted R-squared:  0.01169 
## F-statistic:   156 on 2 and 26213 DF,  p-value: < 2.2e-16

Interpretation of the third linear regression results:

The linear regression (lm.3) present the prediction of the age of the car (dependent var.) based on the gender of the occupant and being able to survive the accident (independent variables).

The intercept starts at a value of 8.3256. This number estimate the average value for Y when X = 0. In this example the intercept shows the average age of the car without checking for other factors.

gender value equals to 1.0374. (1 = male, 0 = female) It means that we have to add 1.0374 if an occupant of the car is a man. In average males tend to have cars older for about 1 year than females.

alive value equals to -1.1805. It means that if the person survived the accident we subtract 1.1805 from the age of the car. People who survived the accidents tend to have newer cars.

Dataset that shows the variables involved in the third interaction model:

The data3 shows two independent variables involved in the interaction and the summary of the dependent variable. The data confirms the results from the regression model. despite of one missing value, we can still notice that the ones who survived the accident tend to be 1 year younger and that females tend to have newer cars than males.

data3 <- nassCDS %>%
  select(carage, sex, dead) %>%
  group_by(sex, dead) %>%
  summarise(mean=mean(carage))
print (data3)
## # A tibble: 4 x 3
## # Groups:   sex [?]
##      sex   dead     mean
##   <fctr> <fctr>    <dbl>
## 1      f  alive 7.141548
## 2      f   dead 8.415948
## 3      m  alive       NA
## 4      m   dead 9.304469

Plotting the visual representations of a data and the linear model which confirms the regression results:

plot3 <- ggplot(data = nassCDS, aes (x = carage,y = alive)) + geom_smooth()
plot3

visreg(lm.3, "alive", by = "gender", scale = "response")

visreg(lm.3, "alive", scale = "response")

Running the fourth linear model with the interaction effect between two independant variables

lm.4 <- lm(speed ~ alive + gender * belted, data=nassCDS)
summary(lm.4)
## 
## Call:
## lm(formula = speed ~ alive + gender * belted, data = nassCDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0818 -0.5783 -0.5058  0.4942  2.4942 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.91554    0.02893 135.336  < 2e-16 ***
## alive         -1.16630    0.02597 -44.904  < 2e-16 ***
## gender         0.16622    0.02041   8.143 4.02e-16 ***
## belted        -0.24346    0.01848 -13.175  < 2e-16 ***
## gender:belted -0.09374    0.02403  -3.902 9.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8635 on 26212 degrees of freedom
## Multiple R-squared:  0.1091, Adjusted R-squared:  0.1089 
## F-statistic: 802.2 on 4 and 26212 DF,  p-value: < 2.2e-16

Interpretation of the fourth linear regression results

In the fourth linear regression (lm.4) we looked at the three independent variables: alive, gender, carage that estimates the speed of the car during the crash (dependent variable). The regression also includes the interaction between the gender and having the seat belts belted and its effect on the prediction of the speed of the car.

The intercept starts at a value of 3.91554. This number estimate the average value for Y when X = 0. In this regression, the number 3.9 stands for the speed of the car close to 40 km/h.

alive value equals to -1.16630. It means that if the person survived the accident we should subtract 1.16630 from 3.91554 to predict the speed of the car during the accident in which the person was alive.

gender value equals to 0.16622. It means that the speed of a car increases 0.16622 if the person was a male.

belted value equals to -0.24346. It means that the speed of the car decreases 0.24346 if the seat belts were belted.

gender:belted value equals to -0.09374. It means that the speed of the car decreases 0.09374 among the males who had the seat belts belted.

Dataset that shows the variables involved in the foutrh interaction model

The data4 shows the variables dead(alive), sex(gender), belted(seatbelts) and its affect on a speed of driving. It confirms the results from the linear regression model. The mean speed increases for people who did not survive the car crash, males tend to drive faster than females and the ones who doesn’t have their seat belt belted also tend to drive faster. All the independent variables in the regression model affect the dependent variable (speed).

data4 <- nassCDS %>%
  select(speed, dead, sex, seatbelt) %>%
  group_by(dead, sex, seatbelt) %>%
  summarise(mean=mean(speed))
print (data4)
## # A tibble: 8 x 4
## # Groups:   dead, sex [?]
##     dead    sex seatbelt     mean
##   <fctr> <fctr>   <fctr>    <dbl>
## 1  alive      f     none 2.745750
## 2  alive      f   belted 2.501478
## 3  alive      m     none 2.942089
## 4  alive      m   belted 2.570837
## 5   dead      f     none 3.958333
## 6   dead      f   belted 3.830645
## 7   dead      m     none 3.834052
## 8   dead      m   belted 4.007937

Plotting the visual representation of a data and the linear model which confirms the regression results

visreg(lm.4, "gender", by = "alive", scale = "response")

visreg(lm.4, "belted", by = "gender", scale = "response")