Note: on data sets

You may use any data of your choosing in the following problems, but I would suggest you choose a data set you find interesting or would give an interesting graph (so, don’t use something like the old iris data set). You will get more out of the project in the end, and it will look better to those in the future you are showing it to. If the data set comes from an R package then reference this. If the data set is from elsewhere, then upload a copy to blackboard (.csv format).

Problem 1 [20 points]

Create a plotly graph of your choosing that represents at least two variables, one of which must be a categorical variable.

This plot can be a scatter plot, overlayed density plots (graphing variable is continuous, separate densities grouped by categorical variable), etc. choropleth maps could also be on the list…you have to admit they look kinda cool.

The graph must include:

  1. customized hover text that is informative to the graphing elements in the plot

  2. separate color to represent groups

  3. labeled axes and appropriate title

Include at least a 1-paragraph discussion about the graph. Discuss what is being plotted and what information is being displayed in the graph. Discuss any information that the reader may gain from hovering the cursor over graphing elements. Discuss any issues/chalenges you had (if any) while making the plot, and you you dealt with or overcame them.

df <- Fatalities %>%
  group_by(state) %>%
  filter(year == max(year)) %>%
  ungroup() %>%
  mutate(state = toupper(state)) 

p1 <- df %>%
  plot_geo(locationmode = "USA-states") %>%
  add_trace(
    z = ~fatal,
    locations = ~state,
    text = ~paste(
      "State:", state,
      "<br>Population:", round(pop, 1),
      "<br>Traffic Fatalities:", round(afatal, 1),
      "<br>Percent of Young Drivers:", round(youngdrivers*100, 1), "%"
    ),
    hoverinfo = "text",
    colorscale = "Reds"
  ) %>%
  layout(
    title = "Traffic Fatalities by State in 1988",
    geo = list(scope = "usa")
  )

p1

This choropleth map displays the traffic fatalities for each U.S. state, excluding Alaska and Hawaii, in 1988. This map indicates fatality counts based on the darkness of the red color in that state. The user is able to find out more about that state’s exact fatality count and the percent of young drivers aged 15-24 by hovering over the state. Not only does interactivity allow for the user to be able to draw comparisons easier, but also the user can explore more of the underlying data and not clutter the visual. The colors and hover text indicate that California has the highest number of traffic fatalities in 1988 with 1246.7 fatalities and 14.9% of the population being young drivers. We can see that the states with larger populations, the traffic fatalities are higher. However, no clear correlation to the percent of young drivers to more traffic fatalities was found here. One challenge I experienced when creating this plot was to make sure each state only appeared once. The original data contained multiple years of data from each state, meaning each state was referenced multiple times. To remedy this, I had to group the data and extract the most recent year before doing the mapping. After resolving that issue, the choropleth displays inline with the data and provides a clearer representation of the data than a typical scatter plot.

Problem 2 [20 points]

Create an animated plotly graph with a data set of your choosing. This can be, but does not have to be a scatter plot. Also, the animation does not have to take place over time. As mentioned in the notes, the frame can be set to a categorical variable. However, the categories the frames cycle through should be organized (if needs be) such that the progression through them shows some pattern or trend.

This graph should include:

  1. Aside from the graphing variable, a separate categorical variable. For example, in our animated scatter plot we color grouped the points by continent.

  2. Appropriate axis labels and a title

  3. Augment the frame label to make it more visible. This can include changing the font size and color to make it stand out more, and/or moving the frame label to a new location in the plotting region. Note, if you do this, make sure it is still clearly visible and does not obstruct the view of your plot.

Include at least a 1-paragraph discussion about the plot. Discuss what you are plotting and what trends can be seen throughout the animation. Discuss any issues, if any, you ran into in making the plot and how you overcame them.

fatal <- Fatalities %>%
  mutate(
    year = as.factor(year),
    state = as.factor(state)
  )

p <- fatal %>%
  plot_ly(
    x = ~miles,
    y = ~fatal,
    color = ~state,
    frame = ~year,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10)
  )

p2 <- p %>%
  add_trace(
    x = max(fatal$miles, na.rm = TRUE) * 0.85,
    y = max(fatal$fatal, na.rm = TRUE) * 0.9,
    text = ~year,
    frame = ~year,
    mode = "text",
    textfont = list(size = 55, color = toRGB("darkgreen")),
    inherit = FALSE
  ) %>%
  animation_slider(hide = TRUE) %>% 
  layout(
    title = "Fatal Traffic Accidents vs Miles Driven in the U.S.",
    xaxis = list(title = "Miles Driven per Driver"),
    yaxis = list(title = "Number of Fatalities")
  )
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
p2

This animated scatter plot visualizes the possible relationship between miles driven per person and total traffic fatalities, between 1982 and 1988 in the U.S. (excluding Hawaii and Alaska). By indicating U.S. states by color, the user can see how different states compare within each year and how their fatality patterns change over time. We can observe that states with higher miles driven often experience more fatalities, suggesting a possible relationship between the two in the years we are investigating. The year label was added directly inside the plot to make each frame easy to identify, and its large, light-colored font ensures that it stands out without overwhelming the data points. A challenge in creating the plot was positioning the frame label so that it did not cover dense clusters of point. To address this, I placed the text in the upper-right corner based on the maximum values of the dataset.

What to turn in: