Problem 1 [20 points]

Create a plotly graph of your choosing that represents at least two variables, one of which must be a categorical variable.

This plot can be a scatter plot, overlayed density plots (graphing variable is continuous, separate densities grouped by categorical variable), etc. choropleth maps could also be on the list…you have to admit they look kinda cool.

You can use any data of your choosing, just reference this.

The graph must include:

customized hover text that is informative to the graphing elements in the plot
separate color to represent groups
labeled axes and appropriate title

Include at least a 1-paragraph discussion about the graph. Discuss what is being plotted and what information is being displayed in the graph. Discuss any information that the reader may gain from hovering the cursor over graphing elements. Discuss any issues/chalenges you had (if any) while making the plot, and you you dealt with or overcame them.

library(plotly)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(nlme)
library(dslabs)

# Create a subset of the data for each diet type
barley <- subset(Milk, Diet == "barley")
barley_lupins <- subset(Milk, Diet == "barley+lupins")
lupins <- subset(Milk, Diet == "lupins")

# Calculate densities of protein for each diet type
d_barley <- density(barley$protein)
d_barley_lupins <- density(barley_lupins$protein)
d_lupins <- density(lupins$protein)

# Use empirical cumulative distribution function to estimate areas under the curve to add to the hover text
d_b_emp <- ecdf(barley$protein)
d_b_l_emp <- ecdf(barley_lupins$protein)
d_l_emp <- ecdf(lupins$protein)

# Create graph
# Shows density of protein by diet type
# Added hover text shows an estimate of the area under the curve

Milk%>%
  plot_ly()%>%
  add_lines(x = d_barley$x, y = d_barley$y, name = "Barley", hoverinfo = "text", text = ~paste("Protein: ", d_barley$x, "<br> Area Under Curve Estimate: ", d_b_emp(d_barley$x)))%>%
  add_lines(x = d_barley_lupins$x, y = d_barley_lupins$y, name = "Barley and Lupins", hoverinfo = "text", text = ~paste("Protein: ", d_barley_lupins$x, "<br> Area Under Curve Estimate: ", d_b_l_emp(d_barley_lupins$x)))%>%
  add_lines(x = d_lupins$x, y = d_lupins$y, name = "Lupins", hoverinfo = "text", text = ~paste("Protein: ", d_lupins$x, "<br> Area Under Curve Estimate: ", d_l_emp(d_lupins$x))) %>%
  layout(title = "Density of Protein Content of Cows, by Diet",
         xaxis = list(title = "Protein Content"),
         yaxis = list(title = "Density"))

This graph shows overlayed density plots for the protein of cow’s milk in the weeks after calving. The data is grouped by diet, which can be either barley, lupins, or barley and lupins. We can see that the distribution for protein content is shifted farthest to the left overall, with the distribution for barley and lupins in the center, and the distribution for barley shifted most to the right. It appears that the three distributions have overall similar variance. I added a hover text label with an estimate for the area under the curve using the ecdf (empirical cumulative distribution function) in base R. I did this because densities are not easily interpretable on their own, and I did not want to draw attention to those values and have them be possibly misinterpreted by a viewer. By adding an estimate for the area under the curve, the viewer can get a grasp on the probability of milk being below a certain level of protein content, broken down by diet type of the cow it came from. As a specific example to highlight the usefulness of this, at a protein content of about 3.56, we can see that cows with diet of barley and those with diet of barley and lupins have densities that about intersect. However, by looking at the estimates of the cumulative density function, we can see that cows with diets of barley have an estimated area under the curve of about 0.54, while the group with a diet of barley and lupins have an estimated area under the curve of about 0.69. So, overall, it is more likely that the cows with diets of barley and lupins have protein content less than about 3.56 units than cows with diets of just barley. An issue I had with this graph was figuring out how to add useful hover text while maintaining accuracy of the densities. I had to make sure that the length of any items that I added as hover text was the same as the length of the densities so that each point had a value. But since this was a density curve, it was not easy to add additional variable information for each point, as it would be in a scatterplot.

Problem 2 [20 points]

Create an animated plotly graph with a data set of your choosing. This can be, but does not have to be a scatter plot. Also, the animation does not have to take place over time. As mentioned in the notes, the frame can be set to a categorical variable. However, the categories the frames cycle through should be organized (if needs be) such that the progression through them shows some pattern.

This graph should include:

Aside from the graphing variable, a separate categorical variable. For example, in our animated scatter plot we color grouped the points by continent.
Appropriate axis labels and a title
Augment the frame label to make it more visible. This can include changing the font size and color to make it stand out more, and/or moving the frame label to a new location in the plotting region. Note, if you do this, make sure it is till clearly visible and does not obstruct the view of your plot.

Include at least a 1-paragraph discussion about the plot. Discuss what you are plotting and what trends can be seen throughout the animation. Discuss any issues, if any, you ran into in making the plot and how you overcame them.

library(modeldata)

taxi %>%
  plot_ly(x = ~distance, color = ~local)%>%
  add_histogram(frame = ~dow, showlegend = TRUE, opacity = 0.95)%>%
  animation_opts(frame = 3000, transition = 1500)%>%
  layout(xaxis = list(title = "Distance (Miles)",color = "black", tickfont = list(size = 18, color = "black"), titlefont = list(size = 16, color = "black")),
         yaxis = list(title = "Frequency",  color = "black", tickfont = list(size = 18, color = "black"), titlefont = list(size = 16, color = "back")), legend = list(title = list(text = "Was trip local?")), 
         title = list(text = "Taxi Trips in Chicago in 2022: Distance and Locality of Trip", font = list(color = "black", size = 17)))%>%
  animation_slider(currentvalue = list(prefix = "Day of Week: ", font = list(color = "blue", size = 20, x = 1.5, y = 1.5)), font = list(color = "black", size = 18))

This taxi dataset from the library modeldata consists of 10,000 records of taxi trips in Chicago in 2022. The data gives details of the trips including day of the week, the hour of the day, the month, the distance of the trip, whether or not the rider gave a tip, and if the trip was local or not (started or ended in the same community). My plot cycles through the days of the week as frames, and plots a histogram of the distance in miles of the trips with the additional categorical variable of whether or not the trip started and ended in the same community. This allows the viewer to look at whether or not there tends to be higher or lower distance trips depending on the day of the week, as well as see counts of how many trips are local or not. The viewer can see overall if more trips are local or not, as well as if the overall proportion of local and non-local trips changes based on day of the week. It is interesting to look at the trends over time because when the plot starts and ends, it seems that there is the least amount of trips overall, compared to the days of the week. This could suggest that a lot of people take these Chicago taxis for work. We can also see over all the plots that there are generally more non-local trips than local trips over all the days. It is quite possible for trips to be non-local, but still have short distance in mileage. It is not as common for trips to start and end in the same community and to have a large distance covered. This suggests that community can quickly be changed on these taxi trips, even if the trips are not very long. One more point of interest is that there is a good deal of consistency in frequency for the non-local trips that have distances between about 0.5 and 2 miles for most days of the week. There could be a systematic pattern for these trips that could be further investigated. The most challenging part of making this plot was to consider what variable to be used as the frame and how to display the values. I actually started with a boxplot broken down by tipping status, but I realized that it did not reveal a lot of trend. There were a lot less people who did not tip than tipped, and I did not like how the boxplot showed differences in the distributions of distance by tipping that could have simply been caused by there being less people who tipped in the dataset. I like having the data displayed in a histogram to be able to show frequencies.

What to turn in:

knit your final assignment to an html document and publish it to an RPubs page.
submit (1) the rmd file and (2) the link to this page in Blackboard (this can be in a word document some other form to submit the link).

Assignment 5 [40 points]

Olivia Schultheis

Problem 1 [20 points]

Problem 2 [20 points]