Exam 4, Fall 2024

ACCT 426 / BUDA 451

Your name: Clayton Hammond

Before submitting, publish your results to rpubs RPubs URL: https://rpubs.com/ClaycanTCode/1258170

Overview

Use the Seatbelts dataset included with r.

Seatbelts is a time series giving the monthly totals of car drivers in Great Britain killed or seriously injured Jan 1969 to Dec 1984. Compulsory wearing of seat belts was introduced on 31 Jan 1983.

See a description of the dataset below:

  • DriversKilled: car drivers killed.
  • front: front-seat passengers killed or seriously injured.
  • rear: rear-seat passengers killed or seriously injured.
  • kms: distance driven.
  • PetrolPrice: petrol price.
  • VanKilled: number of van (‘light goods vehicle’) drivers.
  • l: was the law in effect that month? 1 or 0
# DO NOT EDIT THIS CODE
library(tidyverse)
options(scipen = 10)

data('Seatbelts')
t <- as_tibble(fortify(Seatbelts)) %>% 
  janitor::clean_names() %>% 
  mutate(dt_as_number = time(Seatbelts),
         year = floor(dt_as_number),
         month = (dt_as_number - floor(dt_as_number))*12 + 1,
         people_killed = front + rear + drivers_killed) %>% 
  select(-dt_as_number, -drivers, -van_killed, -drivers_killed,
         -front, -rear, -kms)
  

print(t)
# DO NOT EDIT THIS CODE

Q1: Summarize variables (30%)

Create a graph of each variable in the t tibble.

Then, summarize the data by year and print it out as a table.

Finally, create a correlation matrix chart.

Create a bullet point list describing each variable, and the 2-3 most important relationships.

Add your analysis here

Q2: Time Vis (25%)

Create a time series graph of the number of people killed. Take some time to make it as attractive and well-designed as you can.

Try to answer the question, “does the year and/or month have a major impact on the number of people killed?”

Traffic Deaths can clearly be seen to decrease as years go by, likely to due to better safety technology and regulations

Traffic deaths also demonstrate some seasonality, with a general increase from January to December and peaks in Summer months of July and August, and another peak in December. Both of these peaks are likely due to increased travel during these times.

library(lubridate)

ggplot(data = t2) +
  geom_line(mapping = aes(x = year, y = total_killed)) +
  geom_vline(xintercept = 1982) +
  geom_rect(xmin = 1982, xmax = 1984, ymin = 0, ymax = 20000, fill = 'green', alpha = 0.02) +
  geom_rect(xmin = 1969, xmax = 1982, ymin = 0, ymax = 20000, fill = 'red', alpha = 0.02) +
  labs(title = "Total Drivers Killed in GB Car Accidents Over Time",
       subtitle = "Years in green are after seatbelts were made compulsory.",
       x = "Year",
       y = "Total Killed")


t3 <- t %>% 
  select(month, people_killed) %>% 
  mutate(month = round(month, 0)) %>% 
  group_by(month) %>% 
  summarize(average_killed = mean(people_killed))

ggplot(data = t3) +
  geom_col(aes(x = month, y = average_killed, fill = month)) +
  labs(title = "Average Traffic Deaths by Month",
       subtitle = "Years of data include 1969 to 1984",
       y = "Average Deaths",
       x = "Month")

NA
NA

Q3: Linear Model (25%)

Create a linear regression model that predicts the number of people killed. Explain the quality of your model, as well as the impact of each variable on your results.

Explain the overall results, and well as each variable’s impact. Place Answer here

This model predicts the total traffic deaths in a given month by the variables: petrol_price, law, year, and month. All of these variables are statistically significant within the model, with p values far below .05

The model has an overall r^2 of 0.66, meaning that it explains 66% of the variation in traffic deaths from average.

summary(m1)

Call:
lm(formula = people_killed ~ year + month, data = t)

Residuals:
   Min     1Q Median     3Q    Max 
-354.4 -107.6   -5.7   93.3  393.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 63989.065   4838.959   13.22   <2e-16 ***
year          -31.814      2.448  -12.99   <2e-16 ***
month          38.992      3.269   11.93   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 156.4 on 189 degrees of freedom
Multiple R-squared:  0.6221,    Adjusted R-squared:  0.6181 
F-statistic: 155.6 on 2 and 189 DF,  p-value: < 2.2e-16

Q3: Other Model (20%)

Use a different model to analyze the number of people killed. Explain the quality of your model, as well as the impact of each variable in your results.

Explain the overall results, and well as each variable’s impact. Place Answer here

I used a decision tree model based on year, month, and petrol price. The decision tree first groups data into before and after 1975.

On the left side, it then naturally creates a split for the seatbelt law being passed by splitting on the year being 1983 or higher. After that, it splits on month to reach a numeric prediction.

On the right side, for years 1975 and prior, it splits on month being April or sooner before splitting on petrol prices to reach its final prediction.

Later months, earlier years, and lower petrol prices led to higher predicted deaths. The model had an overall r2 of 0.79

library(rpart)
library(rpart.plot)

m_dt <- rpart(people_killed ~ year+ month + petrol_price, data = t)

rpart.plot(m_dt)


output <- predict(m_dt, t)

t4 <- t %>% 
  mutate(predicted = output) %>% 
  mutate(residual = predicted - people_killed)

rss <- sum(t4$residual ^ 2)
total_variation <- t4$people_killed - mean(t4$people_killed)
total_sum_squares <- sum(total_variation ^ 2)
r2 <- 1 - (rss / total_sum_squares)

print(r2)
[1] 0.7941876
