The year is 2017. You are an analyst at the National Highway Traffic Safety Administration, a government agency dedicated to making roads safer. You have been tasked with assessing the safety implications of a Tesla “self-driving” feature, Autosteer (a lane-keeping assistance technology). Your assignment comes in the wake of a fatal crash involving a Tesla owner who was using the Autosteer function.
To this end, you have received a dataset from Tesla. The dataset describes 43,781 Tesla vehicles before and after the Autosteer technology was enabled, and a variable for each indicating whether or not the vehicle had a crash which led to the airbag being deployed. This class exercise will walk you through some analysis of the data. Your goal is to answer the following question: “Did Autosteer make Teslas safer?”
You’ll be using the tesla_autosteer_data.csv dataset from the Course website. The following instructions will walk you through the analysis. Work in small groups (no more than 5), and discuss steps with each other. There are four variables in this dataset: MilesDriven_before, MilesDriven_after, AirbagEvent_before, AirbagEvent_after.
Download the dataset and put it into your ECON 0210 folder under a sub-folder for class exercises. I’ll assume that your file tree has the following structure somewhere after the Home directory (e.g. \Users\username on a Mac, something like C:\ on a Windows): \ECON 0210\Class exercises\Week 2.
Open a new Rmarkdown file, save it to \ECON 0210\Class exercises\Week 2, close it and re-open it from the Week 2 folder. (Test your understanding: Why are we doing this step?)
Load the tidyverse package.
In the same code chunk, load the data using data <- read_csv("tesla_autosteer_data.csv"). (Test your understanding: What is this command doing? How does it depend on step 2?)
summary() command to calculate some summary statistics for the data.AirbagEvent_before and AirbagEvent_after? Discuss as a group: what does this quantity mean?MilesDriven_before and MilesDriven_after? Does anything look strange?ggplot(data) +
geom_point(aes(x=MilesDriven_before, y=MilesDriven_after)) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color="firebrick3") +
labs(x = "Miles driven before Autosteer", y = "Miles driven after Autosteer")
## Warning: Removed 29179 rows containing missing values (geom_point).
crash_data_summary <- data %>%
summarize( # summarize() applies a function to several rows within a column, reducing the number of rows
mean_miles_before = mean(MilesDriven_before, na.rm=TRUE), # hint: try changing this to na.rm=FALSE. What just happened?
mean_miles_after = mean(MilesDriven_after, na.rm=TRUE),
mean_airbags_before = mean(AirbagEvent_before),
mean_airbags_after = mean(AirbagEvent_after)
) %>%
mutate( # mutate() generates new variables (columns)
crash_rate_before = 1000000*mean_airbags_before/(mean_miles_before), # multiplying by 1,000,000 converts the units from "crashes per mile" to "crashes per million miles"
crash_rate_after = 1000000*mean_airbags_after/(mean_miles_after)
)
crash_data_summary %>% select(crash_rate_before, crash_rate_after) # The select function lets us reduce the number of variables that are going to be displayed/carried to the next step (if there was another %>%)
na.rm option doing? Why does it matter?data_edited <- data %>%
replace_na(list(MilesDriven_before = 0, MilesDriven_after = 0)) # The `replace_na` function replaces NA values. Here, the NA values are being replaced with 0.
crash_data_summary <- data_edited %>%
summarize(
mean_miles_before = mean(MilesDriven_before),
mean_miles_after = mean(MilesDriven_after),
mean_airbags_before = mean(AirbagEvent_before),
mean_airbags_after = mean(AirbagEvent_after)
) %>%
mutate(
crash_rate_before = 1000000*mean_airbags_before/(mean_miles_before),
crash_rate_after = 1000000*mean_airbags_after/(mean_miles_after)
)
crash_data_summary %>% select(crash_rate_before, crash_rate_after)