The year is 2017. You are an analyst at the National Highway Traffic Safety Administration, a government agency dedicated to making roads safer. You have been tasked with assessing the safety implications of a Tesla “self-driving” feature, Autosteer (a lane-keeping assistance technology). Your assignment comes in the wake of a fatal crash involving a Tesla owner who was using the Autosteer function.

To this end, you have received a dataset from Tesla. The dataset describes 43,781 Tesla vehicles before and after the Autosteer technology was enabled, and a variable for each indicating whether or not the vehicle had a crash which led to the airbag being deployed. This class exercise will walk you through some analysis of the data. Your goal is to answer the following question: “Did Autosteer make Teslas safer?”

Did Autosteer make Teslas safer?

You’ll be using the tesla_autosteer_data.csv dataset from the Course website. The following instructions will walk you through the analysis. Work in small groups (no more than 5), and discuss steps with each other. There are four variables in this dataset: MilesDriven_before, MilesDriven_after, AirbagEvent_before, AirbagEvent_after.

Step 1: Downloading and opening the data

  1. Download the dataset and put it into your ECON 0210 folder under a sub-folder for class exercises. I’ll assume that your file tree has the following structure somewhere after the Home directory (e.g. \Users\username on a Mac, something like C:\ on a Windows): \ECON 0210\Class exercises\Week 2.

  2. Open a new Rmarkdown file, save it to \ECON 0210\Class exercises\Week 2, close it and re-open it from the Week 2 folder. (Test your understanding: Why are we doing this step?)

  3. Load the tidyverse package.

  4. In the same code chunk, load the data using data <- read_csv("tesla_autosteer_data.csv"). (Test your understanding: What is this command doing? How does it depend on step 2?)

Step 2: Analyzing the data

  1. Use the summary() command to calculate some summary statistics for the data.
  1. What is the mean of AirbagEvent_before and AirbagEvent_after? Discuss as a group: what does this quantity mean?
  2. Discuss as a group: What do you see in the summaries of MilesDriven_before and MilesDriven_after? Does anything look strange?
  1. Now run the following code in a new chunk to plot the data and inspect it. Discuss as a group: What is each line of the code doing? What does the plot tell you?
ggplot(data) +
  geom_point(aes(x=MilesDriven_before, y=MilesDriven_after)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color="firebrick3") +
  labs(x = "Miles driven before Autosteer", y = "Miles driven after Autosteer")
## Warning: Removed 29179 rows containing missing values (geom_point).
  1. One of your colleagues did some calculations and found that the crash rate (number of airbag events/million miles driven) was higher before Autosteer (1.3 crashes per million miles) was deployed than after (0.8 crashes per million miles). Use the following code to try to replicate their calculations.
crash_data_summary <- data %>% 
  summarize( # summarize() applies a function to several rows within a column, reducing the number of rows
    mean_miles_before = mean(MilesDriven_before, na.rm=TRUE), # hint: try changing this to na.rm=FALSE. What just happened?
    mean_miles_after = mean(MilesDriven_after, na.rm=TRUE),
    mean_airbags_before = mean(AirbagEvent_before),
    mean_airbags_after = mean(AirbagEvent_after)
) %>%
  mutate( # mutate() generates new variables (columns)
    crash_rate_before = 1000000*mean_airbags_before/(mean_miles_before), # multiplying by 1,000,000 converts the units from "crashes per mile" to "crashes per million miles"
    crash_rate_after = 1000000*mean_airbags_after/(mean_miles_after)
  )

crash_data_summary %>% select(crash_rate_before, crash_rate_after) # The select function lets us reduce the number of variables that are going to be displayed/carried to the next step (if there was another %>%)
  1. Discuss as a group: What is the na.rm option doing? Why does it matter?
  2. Test your understanding: What is each line of code doing?
  3. Discuss as a group: What numbers did you get? Do they match what your colleague found? Why or why not?
  1. Your colleague suggests you run the following code to reproduce their analysis:
data_edited <- data %>% 
  replace_na(list(MilesDriven_before = 0, MilesDriven_after = 0)) # The `replace_na` function replaces NA values. Here, the NA values are being replaced with 0.

crash_data_summary <- data_edited %>% 
  summarize(
    mean_miles_before = mean(MilesDriven_before),
    mean_miles_after = mean(MilesDriven_after),
    mean_airbags_before = mean(AirbagEvent_before),
    mean_airbags_after = mean(AirbagEvent_after)
) %>%
  mutate(
    crash_rate_before = 1000000*mean_airbags_before/(mean_miles_before),
    crash_rate_after = 1000000*mean_airbags_after/(mean_miles_after)
  )

crash_data_summary %>% select(crash_rate_before, crash_rate_after)
  1. What numbers did you get? Do they match what your colleague found?
  2. Discuss as a group: Which set of numbers do you think is “more correct” and why?