ACCT 426 / BUDA 451
Your name: Clayton Hammond
Before submitting, publish your results to rpubs RPubs URL:
https://rpubs.com/ClaycanTCode/1258170
Overview
Use the Seatbelts dataset included with r.
Seatbelts is a time series giving the monthly totals of car drivers
in Great Britain killed or seriously injured Jan 1969 to Dec 1984.
Compulsory wearing of seat belts was introduced on 31 Jan 1983.
See a description of the dataset below:
- DriversKilled: car drivers killed.
- front: front-seat passengers killed or seriously injured.
- rear: rear-seat passengers killed or seriously injured.
- kms: distance driven.
- PetrolPrice: petrol price.
- VanKilled: number of van (‘light goods vehicle’) drivers.
- l: was the law in effect that month? 1 or 0
# DO NOT EDIT THIS CODE
library(tidyverse)
options(scipen = 10)
data('Seatbelts')
t <- as_tibble(fortify(Seatbelts)) %>%
janitor::clean_names() %>%
mutate(dt_as_number = time(Seatbelts),
year = floor(dt_as_number),
month = (dt_as_number - floor(dt_as_number))*12 + 1,
people_killed = front + rear + drivers_killed) %>%
select(-dt_as_number, -drivers, -van_killed, -drivers_killed,
-front, -rear, -kms)
print(t)
# DO NOT EDIT THIS CODE
Q1: Summarize variables (30%)
Create a graph of each variable in the t tibble.
Then, summarize the data by year and print it out as a table.
Finally, create a correlation matrix chart.
Create a bullet point list describing each variable, and the 2-3 most
important relationships.
Add your analysis here
The two most important variables for decreasing vehicle accident
deaths appear to be year and the passage of the seatbelt law. While
traffic deaths drastically fell after the passage of the seatbelt law,
they were already trending down over time.
Bullet list for each variable
petrol price- this has a statistically significant negative
correlation with vehicle deaths. The likely reason for this is that
prices generally increase over time, while traffic deaths are seen to
have fallen over time. There could be some impact of higher prices
causing less driving as well.
law- this has a statistically significant correlation with
traffic deaths, and it can be inferred that it is a causal relationship
because seatbelts are meant to protect passengers.
year- this has a statistically significant negative correlation
with traffic deaths, likely due to improvements in car safety technology
over time. Part of the correlation is also likely attributable to the
seatbelt law being passed in the later years of the dataset.
month- this has a statistically significant positive correlation
with traffic deaths. This is likely because later months like November
and December include holidays that people generally travel for such as
Christmas. Winter also starts around this time.
Bullet list for the top 2-3 key relationships
cor.test(t$month, t$people_killed)
Pearson's product-moment correlation
data: t$month and t$people_killed
t = 8.6903, df = 190, p-value = 1.666e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4237079 0.6275326
sample estimates:
cor
0.5333169
Q2: Time Vis (25%)
Create a time series graph of the number of people killed. Take some
time to make it as attractive and well-designed as you can.
Try to answer the question, “does the year and/or month have a major
impact on the number of people killed?”
Traffic Deaths can clearly be seen to decrease as years go by,
likely to due to better safety technology and regulations
Traffic deaths also demonstrate some seasonality, with a general
increase from January to December and peaks in Summer months of July and
August, and another peak in December. Both of these peaks are likely due
to increased travel during these times.
library(lubridate)
ggplot(data = t2) +
geom_line(mapping = aes(x = year, y = total_killed)) +
geom_vline(xintercept = 1982) +
geom_rect(xmin = 1982, xmax = 1984, ymin = 0, ymax = 20000, fill = 'green', alpha = 0.02) +
geom_rect(xmin = 1969, xmax = 1982, ymin = 0, ymax = 20000, fill = 'red', alpha = 0.02) +
labs(title = "Total Drivers Killed in GB Car Accidents Over Time",
subtitle = "Years in green are after seatbelts were made compulsory.",
x = "Year",
y = "Total Killed")

t3 <- t %>%
select(month, people_killed) %>%
mutate(month = round(month, 0)) %>%
group_by(month) %>%
summarize(average_killed = mean(people_killed))
ggplot(data = t3) +
geom_col(aes(x = month, y = average_killed, fill = month)) +
labs(title = "Average Traffic Deaths by Month",
subtitle = "Years of data include 1969 to 1984",
y = "Average Deaths",
x = "Month")

NA
NA
Q3: Linear Model (25%)
Create a linear regression model that predicts the number of people
killed. Explain the quality of your model, as well as the impact of each
variable on your results.
Explain the overall results, and well as each variable’s
impact. Place Answer here
This model predicts the total traffic deaths in a given month by the
variables: petrol_price, law, year, and month. All of these variables
are statistically significant within the model, with p values far below
.05
The model has an overall r^2 of 0.66, meaning that it explains 66%
of the variation in traffic deaths from average.
summary(m1)
Call:
lm(formula = people_killed ~ year + month, data = t)
Residuals:
Min 1Q Median 3Q Max
-354.4 -107.6 -5.7 93.3 393.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63989.065 4838.959 13.22 <2e-16 ***
year -31.814 2.448 -12.99 <2e-16 ***
month 38.992 3.269 11.93 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 156.4 on 189 degrees of freedom
Multiple R-squared: 0.6221, Adjusted R-squared: 0.6181
F-statistic: 155.6 on 2 and 189 DF, p-value: < 2.2e-16
Q3: Other Model (20%)
Use a different model to analyze the number of people killed. Explain
the quality of your model, as well as the impact of each variable in
your results.
Explain the overall results, and well as each variable’s
impact. Place Answer here
I used a decision tree model based on year, month, and petrol price.
The decision tree first groups data into before and after 1975.
On the left side, it then naturally creates a split for the seatbelt
law being passed by splitting on the year being 1983 or higher. After
that, it splits on month to reach a numeric prediction.
On the right side, for years 1975 and prior, it splits on month
being April or sooner before splitting on petrol prices to reach its
final prediction.
Later months, earlier years, and lower petrol prices led to higher
predicted deaths. The model had an overall r2 of 0.79
library(rpart)
library(rpart.plot)
m_dt <- rpart(people_killed ~ year+ month + petrol_price, data = t)
rpart.plot(m_dt)

output <- predict(m_dt, t)
t4 <- t %>%
mutate(predicted = output) %>%
mutate(residual = predicted - people_killed)
rss <- sum(t4$residual ^ 2)
total_variation <- t4$people_killed - mean(t4$people_killed)
total_sum_squares <- sum(total_variation ^ 2)
r2 <- 1 - (rss / total_sum_squares)
print(r2)
[1] 0.7941876
