Note: on data sets

You may use any data of your choosing in the following problems, but I would suggest you choose a data set you find interesting or would give an interesting graph (so, don’t use something like the old iris data set). You will get more out of the project in the end, and it will look better to those in the future you are showing it to. If the data set comes from an R package then reference this. If the data set is from elsewhere, then upload a copy to blackboard (.csv format).

# load in packages
library(readxl)
library(readr)
library(tidyverse)
library(plotly)


Problem 1 [20 points]

Create a plotly graph of your choosing that represents at least two variables, one of which must be a categorical variable.

This plot can be a scatter plot, overlayed density plots (graphing variable is continuous, separate densities grouped by categorical variable), etc. choropleth maps could also be on the list…you have to admit they look kinda cool.

The graph must include:

  1. customized hover text that is informative to the graphing elements in the plot

  2. separate color to represent groups

  3. labeled axes and appropriate title

# load in data 
nba <- read.csv("nba_players_shooting.csv")
nba <- as.data.frame(nba)

# filter the data to just russell westbrook
nba <- nba %>% 
  filter(SHOOTER == "Russell Westbrook")

# view structure 
# str(nba)

nba %>%
  plot_ly(x = ~X, y = ~Y, color = ~SCORE,
          hoverinfo = "text", # custom hover info
          text = ~paste("Defender:", DEFENDER, "<br>",
                        "Range:", RANGE, "<br>",
                        "x coordinate:", round(X,2), "<br>",
                        "y coordinate:", round(Y,2))) %>%
  add_markers(size = 2, 
              colors = c("darkgreen", "tomato")) %>%
  layout(xaxis = list(title = "x position"),
         yaxis = list(title = "y position"),
         title = "Location of Shots Taken by Russell Westbrook")

Include at least a 1-paragraph discussion about the graph. Discuss what is being plotted and what information is being displayed in the graph. Discuss any information that the reader may gain from hovering the cursor over graphing elements. Discuss any issues/challenges you had (if any) while making the plot, and you you dealt with or overcame them.

This plot gives us the x and y coordinates of a shots taken by NBA player, Russell Westbrook, and whether he made them or not. You can think of this plot as picture of the court with the hoop being at the coordinate (0,0) and the points representing the spot that he took the shot from. Green points represent the shots that he made, and red points represent the shots that he missed. If you hover over a point there is a lot of good additional information that you can get about the shot. Specifcally you can see what player was guarding them when they took the shot, as well as the exact coordinates and the range. Some challenges that I ran into when making this plot were that there was shot data for 4 different players in the original data set. When these were all plotted together it was hard to figure out what was going on and there was not a good way to represent/differentiate each players shots and whether they missed or made them. The plot looked confusing and busy, so I chose to limit the plot to just one player to be able to get more information about their individual statistics. From this plot we can observe that Westbrook seems to have a better shooting percentage from the right side of the court, especially the farther away that you get, even though he seems to attempt a similar amount of shots from each side of the court.


Problem 2 [20 points]

Create an animated plotly graph with a data set of your choosing. This can be, but does not have to be a scatter plot. Also, the animation does not have to take place over time. As mentioned in the notes, the frame can be set to a categorical variable. However, the categories the frames cycle through should be organized (if needs be) such that the progression through them shows some pattern.

This graph should include:

  1. Aside from the graphing variable, a separate categorical variable. For example, in our animated scatter plot we color grouped the points by continent.

  2. Appropriate axis labels and a title

  3. Augment the frame label to make it more visible. This can include changing the font size and color to make it stand out more, and/or moving the frame label to a new location in the plotting region. Note, if you do this, make sure it is till clearly visible and does not obstruct the view of your plot.

# load in covid data from (https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv)
df <- read.csv("covid_states_data.csv")

# load in regions data 
regions <- read.csv("states.csv")

# change date class 
df$date <- as.Date(df$date)

# filter to just the year 2020 May - July
df <- df %>%
  filter(year(date) == 2020) %>%
  filter(date >= "2020/05/01" & date < "2020/08/01")

# change date format for frame 
df$date <- as.character(df$date)

# change state column name in regions to merge data frames 
colnames(regions)[1] <- "state" 

# merge the data frames 
df2 <- left_join(df, regions, by = "state")
df2 <- na.omit(df2)

# view structure 
str(df2)
## 'data.frame':    4692 obs. of  9 variables:
##  $ X         : int  3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 ...
##  $ date      : chr  "2020-05-01" "2020-05-01" "2020-05-01" "2020-05-01" ...
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ fips      : int  1 2 4 5 6 8 9 10 11 12 ...
##  $ cases     : num  7294 362 7962 3310 52318 ...
##  $ deaths    : int  289 7 330 64 2147 818 2339 159 231 1313 ...
##  $ State.Code: chr  "AL" "AK" "AZ" "AR" ...
##  $ Region    : chr  "South" "West" "West" "South" ...
##  $ Division  : chr  "East South Central" "Pacific" "Mountain" "West South Central" ...
##  - attr(*, "na.action")= 'omit' Named int [1:368] 12 37 42 50 67 92 97 105 122 147 ...
##   ..- attr(*, "names")= chr [1:368] "12" "37" "42" "50" ...
df2 %>%
  plot_ly(x = ~cases, y = ~deaths, color = ~Region,
          hoverinfo = "text", # custom hover info
          text = ~paste("State:", state, "<br>",
                        "Region:", Region, "<br>",
                        "Division:", Division, "<br>",
                        "Cases:", cases, "<br>",
                        "Deaths:", deaths)) %>%
  add_markers(frame = ~date, # animate over date
              size = 2) %>%  # adjust point size 
  animation_slider(currentvalue = list(prefix = NULL, # adjustment for the current date on the slider
                   font = list(color = toRGB("indianred3"), 
                   size = 20))) %>%
  animation_opts(frame = 100, transition = 50) %>%
  layout(xaxis = list(title = "number of cases"),
         yaxis = list(title = "number of deaths"),
         title = list(text = "Covid Deaths vs Cases for 2020 by State",
                    font = list(size = 17, color = toRGB("indianred3")),
                    y = 0.95, x = 0.13)) # change position of title for readability

Include at least a 1-paragraph discussion about the plot. Discuss what you are plotting and what trends can be seen throughout the animation. Discuss any issues, if any, you ran into in making the plot and how you overcame them.

This is a plot that shows the number of Covid deaths versus cases animated over they days for the months of May, June, and July in 2020. Each point represents a state and the points are colored according to the region that the state is located. I chose to color the points by region becuase I suspected that cases would spread between regions due to the close proximity. If you hover over a point there is more descriptive information about the plot. You will see the state name, region, dividsion, death count, and case count. I did run into some challenges when making this plot. The first was that the original dataset was for everyday over a 3 year period which was so much data to animate over that the plot would not load. To fix this I chose to focus on the 3 motnh period over the summer. I chose this because this was still at the beginning of the pandemic when cases started really emerging, and I wanted the regions that were heavily affected first. The second issue I ran into was representing the second categorical variable. I originally had each point colored by state, but that also took too long to render and as not very useful. Since the dataset did not have another variable to include, I looked up a dataset that included the region and divisions for each state and merged them. This allowed me to represent similar location information but in a more meaningful way. Now you are able to see which regions were affected first by Covid cases. From this plot we can see that at the beginning of the summer New York has by far the most cases and deaths followed by New Jersey which also has significantly more than the rest of the states. As we look at the progression of the plot over time we can see that the rate of growth for the cases speeds up over time. Cases in California, Florida, and Texas are all growing at very fast rate with cases starting at around 50,000 in May and growing to about 500,000 at the end of July. The death rates are not increasing as drastically as the case rates, with those 3 states starting out wtih roughly 1,000 deaths and ending with around 7,500 deaths. Besides California we see the least amount of growth for the West region.

What to turn in:

  • knit your final assignment to an html document and publish it to an RPubs page.

  • submit (1) the rmd file and (2) the link to this page in Blackboard (this can be in a word document or some other form to submit the link).