Note: on data sets

You may use any data of your choosing in the following problems, but I would suggest you choose a data set you find interesting or would give an interesting graph (so, don’t use something like the old iris data set). You will get more out of the project in the end, and it will look better to those in the future you are showing it to. If the data set comes from an R package then reference this. If the data set is from elsewhere, then upload a copy to blackboard (.csv format).


Problem 1 [20 points]

Create a plotly graph of your choosing that represents at least two variables, one of which must be a categorical variable.

This plot can be a scatter plot, overlayed density plots (graphing variable is continuous, separate densities grouped by categorical variable), etc. choropleth maps could also be on the list…you have to admit they look kinda cool.

The graph must include:

  1. customized hover text that is informative to the graphing elements in the plot

  2. separate color to represent groups

  3. labeled axes and appropriate title

# Import data
MLB <- read.csv("2023-2024_Data.csv")

Make a scatterplot of team ERA vs team walks allowed to see if the more walks given up leads to a higher team ERA. This will display 30 points for each year, one per team, per year. This will be colored to differentiate 2023 and 2024.

# Check structure of variables
str(MLB[, c(8, 21, 37)])
## 'data.frame':    60 obs. of  3 variables:
##  $ ERA : num  4.62 3.49 3.94 4.04 3.78 4.67 4.09 3.61 5.47 3.61 ...
##  $ BB  : int  481 449 481 461 485 643 487 492 563 416 ...
##  $ Year: int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
# Turn Year into a factor
MLB$Year <- as.factor(MLB$Year)
# Call packages
library(plotly)
library(tidyverse)
library(dplyr)

# Make initial plot
MLB %>%
  plot_ly(x = ~ BB, y = ~ ERA,
          color = ~ Year,
          hoverinfo = "text",
          text = ~ paste("Team:", Tm, "<br>",
                         "Walks:", BB, "<br>",
                         "ERA:", ERA)) %>%
  add_markers(colors = c("red", "blue")) %>%
  layout(xaxis = list(title = "Walks Allowed"),
         yaxis = list(title = "Earned Run Average"),
         title = "Scatterplot of ERA vs Walks Allowed by MLB Teams in 2023 and 2024")

Include at least a 1-paragraph discussion about the graph. Discuss what is being plotted and what information is being displayed in the graph. Discuss any information that the reader may gain from hovering the cursor over graphing elements. Discuss any issues/chalenges you had (if any) while making the plot, and you you dealt with or overcame them.

The data for this graph came from the baseball reference website. I found pitching data on all 30 MLB teams for the years 2024 and 2023. I was interested to see if the number of walks allowed by a team has an effect on ERA, which is the number of earned runs allowed. I was also curios to see which teams were better in the walks and ERA categories and which teams weren’t. This I think gives a good overview of how good a teams pitching is. I think that walks allowed is a better categories to look at than hits allowed or home runs allowed since that is also dependent on the batter whereas, walks allowed is more on the pitcher than the batter in most cases. If a team goes against another team that is good offensively and gives up hits to the best players in the league, I don’t really think that is an offense to the pitching but it is a complement to the batters for being good. That is why I chose walks allowed since it gives a better overview of the team’s pitching performance. The biggest challenge I had making this graph was the general data cleaning. This data came from two different files from baseball reference, one for 2024 and one for 2023, so in excel, I had to remove unneccessary rows and merge the data to one file and add another column for what year it was so I could change that to a factor to group the data.

This graph is a graph of all 30 teams for each of the 2023 and 2024 season and the scatterplot of ERA vs walks allowed. Some important information one can gather from this graph is information about, in general was pitching better in 2023 or 2024, seen from if there was an abundance of one color points above the other color. Other information can be the team ERA and team walks allowed for each team. This was not very helpful until I added the text to each point which includes the team name and the team ERA and the team walks allowed. I did not include the year on each point to avoid redundency since that is already there with the color. More information can be seen on which teams had better pitching an which did not. Some interesting points to highlight are some of the influential points. The team with the lowest team ERA and walks allowed was the 2024 Seattle Mariners. This team had the best pitching statistics in the league last year but the team batting statistics were one of the worst in the league and because of this they had a disappointing season. The teams with the highest ERA was the 2023 and 2024 Colorado Rockies and the 2023 Oakland Athletics. The Colorado Rockies always have one of the highest team ERA’s every year. One, there are a very bad team recently and also their stadium is high in the mountains. The ballpark is very large dimensionally and with the thin air, baseballs fly, and with the very large outfield, it is a hitter’s ballpark. There is a striking amount of runs scored in their stadium than average because of the thin air and the big field. The Oakland Athletics were a very bad team in 2023 so then it makes sense that their team ERA and walks were up.


Problem 2 [20 points]

Create an animated plotly graph with a data set of your choosing. This can be, but does not have to be a scatter plot. Also, the animation does not have to take place over time. As mentioned in the notes, the frame can be set to a categorical variable. However, the categories the frames cycle through should be organized (if needs be) such that the progression through them shows some pattern.

This graph should include:

  1. Aside from the graphing variable, a separate categorical variable. For example, in our animated scatter plot we color grouped the points by continent.

  2. Appropriate axis labels and a title

  3. Augment the frame label to make it more visible. This can include changing the font size and color to make it stand out more, and/or moving the frame label to a new location in the plotting region. Note, if you do this, make sure it is till clearly visible and does not obstruct the view of your plot.

Use the Gapminder dataset from the dslabs package to make a scatterplot to see an association between gdp vs fertility rate

# Import data set
library(dslabs)

# Filter out the 2012 to 2016 years
gapminder %>%
  filter(year < 2012) -> gap_filter


# General scatterplot (scale y axis as well)
gap_filter %>%
  plot_ly(x = ~ fertility, y = ~ gdp,
          hoverinfo = "text",
          text = ~ paste("Country:", country, "<br>",
                         "Children:", fertility, "<br>",
                         "GDP:", gdp)) %>%
  add_markers(frame = ~ year,
              ids = ~ country,
              size = ~ gdp, color = ~ continent) %>%
  layout(xaxis = list(title = "Average Number of Children per Mother"),
         yaxis = list(title = "Gross Domestic Product", type = "log"),
         title = "GDP vs Fertility Rate for Countries Grouped by Continent") %>%
  add_text(x = 5.5, y = 3e+12, text = ~ year, frame = ~ year,
           textfont = list(size = 80, color = toRGB("gray"))) %>%
  animation_slider(currentvalue = list(font = list(color = "white")))

Include at least a 1-paragraph discussion about the plot. Discuss what you are plotting and what trends can be seen throughout the animation. Discuss any issues, if any, you ran into in making the plot and how you overcame them.

Next, I will discuss some issues I has to fix while making the graph and explain some general trends. The first problem I had with the graph was that I had to scale the y axis for GDP. Since the countries had so much variability in this variable, I had to scale it in order for the graph to display a nice picture. The next problem I had was that I noticed that when I made the initial plot, after 2011, the graph stopped moving and the years 2012-2016 just flew up on the screen without the graph changing. I then investigated the data and saw that there was no values for GDP for any observation from 2012-2016. I filtered the data accordingly and then I was able to get the animated plot to just go from 1960 to 2011. The last problem I had was that I could not get the text font to display correctly. I got it to print but then the size and color did not show up. I noticed that when I followed the code from the in-class example on how to display text, my code did not work then either, even though it worked on your end. I investigated this some more to make sure I actually used correct syntax in the function and it all matched up. I then looked at the internet and after not finding many answers, I used the help file for add_text() and after not finding much there either, I eventually just messed with the code some more. In the videos the function inside add_text() was textFont() but I then spelled it textfont() and it worked. I am not exactly sure if the function got renamed with different versions or if this was a mistake on my end, but I was happy that I was able to figure that one out. The general trend in the graph is that over the years the average fertility rate goes down and there is not a big change in general GDP. From the beginning years until about 1980, we can see that there are two pretty distinct groups but then as the years go on, they mix in more together, and then at the ending years, there is a large group with low fertility rates and a smaller group with large fertility rates. In the early years, we can see that a lot of the points in that low group are all pink, symbolizing Europe. Then at the ending years we can see that most of that chunk of high fertility rates are all darker green symbolizing Africa. These points are also lower on the GDP scale which further proves my point from before of poorer countries having more children. We see this in the last assignment with life expectancies and it is seen here too which I think is interesting. The more industrialized a country is, the less children they have on average while the poorer countries tend to be more on the traditional side of family heritage and have more kids overall.

What to turn in:

  • knit your final assignment to an html document and publish it to an RPubs page.

  • submit (1) the rmd file and (2) the link to this page in Blackboard (this can be in a word document or some other form to submit the link).

Published RPUBS Page