In this assignment I am going to use the “nassCDS” data set about car crashes in United States from 1997 - 2002. I will design linear models in order to conduct a set of regression analyses and interpret the results.
library(DAAG)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(texreg)
library(visreg)
library(sjmisc)
data(nassCDS)
head(nassCDS)
## dvcat weight dead airbag seatbelt frontal sex ageOFocc yearacc yearVeh
## 1 25-39 25.069 alive none belted 1 f 26 1997 1990
## 2 10-24 25.069 alive airbag belted 1 f 72 1997 1995
## 3 10-24 32.379 alive none none 1 f 69 1997 1988
## 4 25-39 495.444 alive airbag belted 1 f 53 1997 1995
## 5 25-39 25.069 alive none belted 1 f 32 1997 1988
## 6 40-54 25.069 alive none belted 1 f 22 1997 1985
## abcat occRole deploy injSeverity caseid
## 1 unavail driver 0 3 2:3:1
## 2 deploy driver 1 1 2:3:2
## 3 unavail driver 0 4 2:5:1
## 4 deploy driver 1 1 2:10:1
## 5 unavail driver 0 3 2:11:1
## 6 unavail driver 0 3 2:11:2
nassCDS <- nassCDS %>%
mutate(dead1 = as.integer(dead)) %>%
mutate(seatbelt1 = as.integer(seatbelt)) %>%
mutate (airbag1 = as.integer(airbag)) %>%
mutate (sex1 = as.integer(sex))
nassCDS <- nassCDS %>%
select(dead, dead1, seatbelt, seatbelt1, airbag, airbag1,sex, sex1, everything())
s <- NA
nassCDS$speed <- s
nassCDS$speed[nassCDS$dvcat == "1-9km/h"] <- 1
nassCDS$speed[nassCDS$dvcat == "10-24"] <- 2
nassCDS$speed[nassCDS$dvcat == "25-39"] <- 3
nassCDS$speed[nassCDS$dvcat == "40-54"] <- 4
nassCDS$speed[nassCDS$dvcat == "55+"] <- 5
summary(nassCDS$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.685 3.000 5.000
nassCDS$carage <- (nassCDS$yearacc-(nassCDS$yearVeh-1))
summary(nassCDS$carage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 7.000 7.751 11.000 50.000 1
head(nassCDS)
## dead dead1 seatbelt seatbelt1 airbag airbag1 sex sex1 dvcat weight
## 1 alive 1 belted 2 none 1 f 1 25-39 25.069
## 2 alive 1 belted 2 airbag 2 f 1 10-24 25.069
## 3 alive 1 none 1 none 1 f 1 10-24 32.379
## 4 alive 1 belted 2 airbag 2 f 1 25-39 495.444
## 5 alive 1 belted 2 none 1 f 1 25-39 25.069
## 6 alive 1 belted 2 none 1 f 1 40-54 25.069
## frontal ageOFocc yearacc yearVeh abcat occRole deploy injSeverity
## 1 1 26 1997 1990 unavail driver 0 3
## 2 1 72 1997 1995 deploy driver 1 1
## 3 1 69 1997 1988 unavail driver 0 4
## 4 1 53 1997 1995 deploy driver 1 1
## 5 1 32 1997 1988 unavail driver 0 3
## 6 1 22 1997 1985 unavail driver 0 3
## caseid speed carage
## 1 2:3:1 3 8
## 2 2:3:2 2 3
## 3 2:5:1 2 10
## 4 2:10:1 3 3
## 5 2:11:1 3 10
## 6 2:11:2 4 13
nassCDS <- nassCDS %>%
mutate(alive = sjmisc::rec(dead1, rec = "2=0; 1=1")) %>%
mutate (belted = sjmisc::rec(seatbelt1, rec = "2=1; 1=0")) %>%
mutate (airbagcar = sjmisc::rec(airbag1, rec = "1=0; 2=1")) %>%
mutate (gender = sjmisc::rec(sex1, rec = "1=0; 2=1")) %>%
select(dead, alive, seatbelt, belted, airbag, airbagcar, sex, gender, dvcat, speed, everything()) %>%
select(-dead1, -seatbelt1, -airbag1, -sex1)
head(nassCDS)
## dead alive seatbelt belted airbag airbagcar sex gender dvcat speed
## 1 alive 1 belted 1 none 0 f 0 25-39 3
## 2 alive 1 belted 1 airbag 1 f 0 10-24 2
## 3 alive 1 none 0 none 0 f 0 10-24 2
## 4 alive 1 belted 1 airbag 1 f 0 25-39 3
## 5 alive 1 belted 1 none 0 f 0 25-39 3
## 6 alive 1 belted 1 none 0 f 0 40-54 4
## weight frontal ageOFocc yearacc yearVeh abcat occRole deploy
## 1 25.069 1 26 1997 1990 unavail driver 0
## 2 25.069 1 72 1997 1995 deploy driver 1
## 3 32.379 1 69 1997 1988 unavail driver 0
## 4 495.444 1 53 1997 1995 deploy driver 1
## 5 25.069 1 32 1997 1988 unavail driver 0
## 6 25.069 1 22 1997 1985 unavail driver 0
## injSeverity caseid carage
## 1 3 2:3:1 8
## 2 1 2:3:2 3
## 3 4 2:5:1 10
## 4 1 2:10:1 3
## 5 3 2:11:1 10
## 6 3 2:11:2 13
lm.1 <- lm(carage ~ yearacc + belted * airbagcar, data = nassCDS)
summary(lm.1)
##
## Call:
## lm(formula = carage ~ yearacc + belted * airbagcar, data = nassCDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.859 -2.317 -0.352 1.687 35.632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004e+03 2.634e+01 -38.12 <2e-16 ***
## yearacc 5.088e-01 1.318e-02 38.62 <2e-16 ***
## belted -1.434e+00 6.779e-02 -21.16 <2e-16 ***
## airbagcar -9.016e+00 8.282e-02 -108.86 <2e-16 ***
## belted:airbagcar 1.413e+00 9.801e-02 14.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.557 on 26211 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.557, Adjusted R-squared: 0.5569
## F-statistic: 8239 on 4 and 26211 DF, p-value: < 2.2e-16
In this linear regression (lm.1), we looked at the effect of the year of the accident, seat belts and the airbag (independent variables) on the age of the vehicle(carage = dependent variable). The model includes the interaction between having the seat belts belted and having the airbag in the car and its effect on the prediction of the age of the car. We can notice that all the relationships are statistically significant by looking at the p-value.
The intercept starts at a value of -1004,24781. This number estimate the average value for Y when X = 0.
yearacc value equals to 0.50880. It means that for each year of accident the age of the car increases 0.50880. For example, lets say that the accident occurred in 2000. To estimate the age of the car based on this information we have to multiply the number 0.50880 with 2000 and then add the value of the intercept to this equation:
0.50880 * 2000 + (-1004.24781) = 13.35219
The car which had an accident in 2000, is estimated to be 13.35219 years old (without checking for other possible coefficients)
belted value equals to -1.43425.(1= belted, 0 = non belted). It means that if the seat belts were belted the age of the car decreases 1.43425. If the belts were not belted, we don’t take this number into consideration while estimating the age of the car.
airbagcar value equals to -9.01625. (1= airbag, 0 = no airbag). It means that if there was an airbag in a car, the age of the car decreases 9.01625. If there was no airbag in the car, we don’t take this number into consideration while estimating the age of the car.
belted:airbagcar value equals 1.41287. It means that if both the seat belts were belted and there was an airbag in a car, the age of the car in a given year of the accident increases 1.41287.
The data1 below shows the variables involved in the interaction which confirms the results of the linear regression. The data1 is grouped by the independent variables which are :the year of accident (yearacc), seatbelt and airbag and the mean value of dependent variable which is the mean age of the car (carage). The table shows the mean age of the car for each year of the accident. We can notice how the mean age of the car changes depending on other independent variables. We can also see that the mean age of the car without the seat belts and airbag is close to 13.35219 which proves the equation and correctness of the regression model number 1 (lm.1).
data1 <- nassCDS %>%
select(carage, yearacc, seatbelt, airbag) %>%
group_by(yearacc, seatbelt, airbag) %>%
summarise(mean=mean(carage))
print (data1)
## # A tibble: 24 x 4
## # Groups: yearacc, seatbelt [?]
## yearacc seatbelt airbag mean
## <dbl> <fctr> <fctr> <dbl>
## 1 1997 none none 11.404894
## 2 1997 none airbag 3.321429
## 3 1997 belted none 9.778605
## 4 1997 belted airbag NA
## 5 1998 none none 12.091541
## 6 1998 none airbag 3.452489
## 7 1998 belted none 10.731017
## 8 1998 belted airbag 3.485332
## 9 1999 none none 13.059701
## 10 1999 none airbag 3.852459
## # ... with 14 more rows
visreg(lm.1, "yearacc", scale = "response")
## Conditions used in construction of plot
## belted: 1
## airbagcar: 1
visreg(lm.1, "yearacc", by = "airbagcar", scale = "response")
lm.2<- lm(ageOFocc ~ alive + factor(speed), data=nassCDS)
summary(lm.2)
##
## Call:
## lm(formula = ageOFocc ~ alive + factor(speed), data = nassCDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.336 -14.708 -3.948 10.292 65.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.6795 0.8767 53.243 < 2e-16 ***
## alive -10.3882 0.5582 -18.612 < 2e-16 ***
## factor(speed)2 1.6568 0.6960 2.380 0.0173 *
## factor(speed)3 0.4170 0.7061 0.591 0.5548
## factor(speed)4 -1.7765 0.7547 -2.354 0.0186 *
## factor(speed)5 -5.1573 0.8334 -6.188 6.18e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.76 on 26211 degrees of freedom
## Multiple R-squared: 0.01672, Adjusted R-squared: 0.01653
## F-statistic: 89.12 on 5 and 26211 DF, p-value: < 2.2e-16
In this linear regression (lm.2) we are looking at the effects of speed of driving and being able to survive the car accident (independent variables) that affect the age of the occupant of the car (dependent variable). In other worlds, we want to predict the age of the occupant based on the speed of the car during the accident and ability to survive the crash.
The intercept starts at a value of 46.6795 This number estimate the average value for Y when X = 0.
alive value equals to -10.3882. (1= alive, 0 = dead) It means that if the person survived the car crash, we subtract 10.3882 from the “starting” age of the person given in an intercept. 46.6795 is an approximate age of a person who didn’t survive the car accident (without taking into consideration the speed of driving) 46.6795 - 10.3882 = 36,2913 is an approximate age of a person who survived the car accident (without taking into consideration the speed of driving)
speed The variable speed is coded so that: 1 indicates 1-9km/h, 2 = 10-24 km/h, 3 = 25-39 km/h, 4 = 40-54 km/h, 5 = 55+ km/h.
factor(speed)2 If the person was driving with a speed between 10-24 km/h, we add 1.6568 to the age of the occupant. (Older people tend to drive slower than younger people) To count the age of the person who survived the accident and was driving between 10-24 km/h, we would count it as below:
46.6795 - 10.3882 + 1.6568 = 37.9481 (average age of the person)
factor(speed)3 If the person was driving with a speed between 25-39 km/h, we add 0.4170 to the age of the occupant.
factor(speed)4 If the person was driving with a speed between 40-54 km/h, we subtract 1.7765 (add -1.7765) from the age of the occupant. (The age of the person decreases because younger people tend to drive faster than older people)
factor(speed)5 If the person was driving with a speed of 55+ km/h, we subtract 5.1573 from the age of the occupant.
The data2 below presents the variables involved in the interaction and confirms the validity of the regression results. We can notice that the average age of the people who survived the car accident 36 and the average age for the people who did not survive the accident is 47. The data2 also show that the average age of people who drives faster (speed 5) is smaller than the age of people who drives slower (speed 1).
data2 <- nassCDS %>%
select(ageOFocc, dead, speed) %>%
group_by(dead, speed) %>%
summarise(mean=mean(ageOFocc))
print (data2)
## # A tibble: 10 x 3
## # Groups: dead [?]
## dead speed mean
## <fctr> <dbl> <dbl>
## 1 alive 1 36.35871
## 2 alive 2 37.92359
## 3 alive 3 36.54475
## 4 alive 4 34.76795
## 5 alive 5 31.96286
## 6 dead 1 31.33333
## 7 dead 2 51.07018
## 8 dead 3 51.35197
## 9 dead 4 42.96512
## 10 dead 5 39.37108
plot2 <- ggplot(data = nassCDS, aes (x = ageOFocc, y = speed)) + geom_col() + scale_y_continuous(breaks=c(seq(1,5)))
plot2
visreg(lm.2, "alive", scale = "response")
lm.3 <- lm(carage ~ gender + alive, data = nassCDS)
summary(lm.3)
##
## Call:
## lm(formula = carage ~ gender + alive, data = nassCDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.363 -4.182 -1.145 3.674 42.855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3256 0.1597 52.126 < 2e-16 ***
## gender 1.0374 0.0658 15.766 < 2e-16 ***
## alive -1.1805 0.1583 -7.456 9.2e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.312 on 26213 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01176, Adjusted R-squared: 0.01169
## F-statistic: 156 on 2 and 26213 DF, p-value: < 2.2e-16
The linear regression (lm.3) present the prediction of the age of the car (dependent var.) based on the gender of the occupant and being able to survive the accident (independent variables).
The intercept starts at a value of 8.3256. This number estimate the average value for Y when X = 0. In this example the intercept shows the average age of the car without checking for other factors.
gender value equals to 1.0374. (1 = male, 0 = female) It means that we have to add 1.0374 if an occupant of the car is a man. In average males tend to have cars older for about 1 year than females.
alive value equals to -1.1805. It means that if the person survived the accident we subtract 1.1805 from the age of the car. People who survived the accidents tend to have newer cars.
The data3 shows two independent variables involved in the interaction and the summary of the dependent variable. The data confirms the results from the regression model. despite of one missing value, we can still notice that the ones who survived the accident tend to be 1 year younger and that females tend to have newer cars than males.
data3 <- nassCDS %>%
select(carage, sex, dead) %>%
group_by(sex, dead) %>%
summarise(mean=mean(carage))
print (data3)
## # A tibble: 4 x 3
## # Groups: sex [?]
## sex dead mean
## <fctr> <fctr> <dbl>
## 1 f alive 7.141548
## 2 f dead 8.415948
## 3 m alive NA
## 4 m dead 9.304469
plot3 <- ggplot(data = nassCDS, aes (x = carage,y = alive)) + geom_smooth()
plot3
visreg(lm.3, "alive", by = "gender", scale = "response")
visreg(lm.3, "alive", scale = "response")
lm.4 <- lm(speed ~ alive + gender * belted, data=nassCDS)
summary(lm.4)
##
## Call:
## lm(formula = speed ~ alive + gender * belted, data = nassCDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0818 -0.5783 -0.5058 0.4942 2.4942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.91554 0.02893 135.336 < 2e-16 ***
## alive -1.16630 0.02597 -44.904 < 2e-16 ***
## gender 0.16622 0.02041 8.143 4.02e-16 ***
## belted -0.24346 0.01848 -13.175 < 2e-16 ***
## gender:belted -0.09374 0.02403 -3.902 9.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8635 on 26212 degrees of freedom
## Multiple R-squared: 0.1091, Adjusted R-squared: 0.1089
## F-statistic: 802.2 on 4 and 26212 DF, p-value: < 2.2e-16
In the fourth linear regression (lm.4) we looked at the three independent variables: alive, gender, carage that estimates the speed of the car during the crash (dependent variable). The regression also includes the interaction between the gender and having the seat belts belted and its effect on the prediction of the speed of the car.
The intercept starts at a value of 3.91554. This number estimate the average value for Y when X = 0. In this regression, the number 3.9 stands for the speed of the car close to 40 km/h.
alive value equals to -1.16630. It means that if the person survived the accident we should subtract 1.16630 from 3.91554 to predict the speed of the car during the accident in which the person was alive.
gender value equals to 0.16622. It means that the speed of a car increases 0.16622 if the person was a male.
belted value equals to -0.24346. It means that the speed of the car decreases 0.24346 if the seat belts were belted.
gender:belted value equals to -0.09374. It means that the speed of the car decreases 0.09374 among the males who had the seat belts belted.
The data4 shows the variables dead(alive), sex(gender), belted(seatbelts) and its affect on a speed of driving. It confirms the results from the linear regression model. The mean speed increases for people who did not survive the car crash, males tend to drive faster than females and the ones who doesn’t have their seat belt belted also tend to drive faster. All the independent variables in the regression model affect the dependent variable (speed).
data4 <- nassCDS %>%
select(speed, dead, sex, seatbelt) %>%
group_by(dead, sex, seatbelt) %>%
summarise(mean=mean(speed))
print (data4)
## # A tibble: 8 x 4
## # Groups: dead, sex [?]
## dead sex seatbelt mean
## <fctr> <fctr> <fctr> <dbl>
## 1 alive f none 2.745750
## 2 alive f belted 2.501478
## 3 alive m none 2.942089
## 4 alive m belted 2.570837
## 5 dead f none 3.958333
## 6 dead f belted 3.830645
## 7 dead m none 3.834052
## 8 dead m belted 4.007937
visreg(lm.4, "gender", by = "alive", scale = "response")
visreg(lm.4, "belted", by = "gender", scale = "response")