Note: on data sets
You may use any data of your choosing in the following problems, but
I would suggest you choose a data set you find interesting or would give
an interesting graph (so, don’t use something like the old iris
data set). You will get more out of the project in the end, and it will
look better to those in the future you are showing it to. If the data
set comes from an R package then reference this. If the data set is from
elsewhere, then upload a copy to blackboard (.csv format).
Problem 1 [20 points]
Create a plotly graph of your choosing that represents at least two
variables, one of which must be a categorical variable.
This plot can be a scatter plot, overlayed density plots (graphing
variable is continuous, separate densities grouped by categorical
variable), etc. choropleth maps could also be on the list…you have to
admit they look kinda cool.
The graph must include:
customized hover text that is informative to the graphing
elements in the plot
separate color to represent groups
labeled axes and appropriate title
# Import data
MLB <- read.csv("2023-2024_Data.csv")
Make a scatterplot of team ERA vs team walks allowed to see if the
more walks given up leads to a higher team ERA. This will display 30
points for each year, one per team, per year. This will be colored to
differentiate 2023 and 2024.
# Check structure of variables
str(MLB[, c(8, 21, 37)])
## 'data.frame': 60 obs. of 3 variables:
## $ ERA : num 4.62 3.49 3.94 4.04 3.78 4.67 4.09 3.61 5.47 3.61 ...
## $ BB : int 481 449 481 461 485 643 487 492 563 416 ...
## $ Year: int 2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
# Turn Year into a factor
MLB$Year <- as.factor(MLB$Year)
# Call packages
library(plotly)
library(tidyverse)
library(dplyr)
# Make initial plot
MLB %>%
plot_ly(x = ~ BB, y = ~ ERA,
color = ~ Year,
hoverinfo = "text",
text = ~ paste("Team:", Tm, "<br>",
"Walks:", BB, "<br>",
"ERA:", ERA)) %>%
add_markers(colors = c("red", "blue")) %>%
layout(xaxis = list(title = "Walks Allowed"),
yaxis = list(title = "Earned Run Average"),
title = "Scatterplot of ERA vs Walks Allowed by MLB Teams in 2023 and 2024")
Include at least a 1-paragraph discussion about the graph. Discuss
what is being plotted and what information is being displayed in the
graph. Discuss any information that the reader may gain from hovering
the cursor over graphing elements. Discuss any issues/chalenges you had
(if any) while making the plot, and you you dealt with or overcame
them.
The data for this graph came from the baseball reference website. I
found pitching data on all 30 MLB teams for the years 2024 and 2023. I
was interested to see if the number of walks allowed by a team has an
effect on ERA, which is the number of earned runs allowed. I was also
curios to see which teams were better in the walks and ERA categories
and which teams weren’t. This I think gives a good overview of how good
a teams pitching is. I think that walks allowed is a better categories
to look at than hits allowed or home runs allowed since that is also
dependent on the batter whereas, walks allowed is more on the pitcher
than the batter in most cases. If a team goes against another team that
is good offensively and gives up hits to the best players in the league,
I don’t really think that is an offense to the pitching but it is a
complement to the batters for being good. That is why I chose walks
allowed since it gives a better overview of the team’s pitching
performance. The biggest challenge I had making this graph was the
general data cleaning. This data came from two different files from
baseball reference, one for 2024 and one for 2023, so in excel, I had to
remove unneccessary rows and merge the data to one file and add another
column for what year it was so I could change that to a factor to group
the data.
This graph is a graph of all 30 teams for each of the 2023 and 2024
season and the scatterplot of ERA vs walks allowed. Some important
information one can gather from this graph is information about, in
general was pitching better in 2023 or 2024, seen from if there was an
abundance of one color points above the other color. Other information
can be the team ERA and team walks allowed for each team. This was not
very helpful until I added the text to each point which includes the
team name and the team ERA and the team walks allowed. I did not include
the year on each point to avoid redundency since that is already there
with the color. More information can be seen on which teams had better
pitching an which did not. Some interesting points to highlight are some
of the influential points. The team with the lowest team ERA and walks
allowed was the 2024 Seattle Mariners. This team had the best pitching
statistics in the league last year but the team batting statistics were
one of the worst in the league and because of this they had a
disappointing season. The teams with the highest ERA was the 2023 and
2024 Colorado Rockies and the 2023 Oakland Athletics. The Colorado
Rockies always have one of the highest team ERA’s every year. One, there
are a very bad team recently and also their stadium is high in the
mountains. The ballpark is very large dimensionally and with the thin
air, baseballs fly, and with the very large outfield, it is a hitter’s
ballpark. There is a striking amount of runs scored in their stadium
than average because of the thin air and the big field. The Oakland
Athletics were a very bad team in 2023 so then it makes sense that their
team ERA and walks were up.
Problem 2 [20 points]
Create an animated plotly graph with a data set of your choosing.
This can be, but does not have to be a scatter plot. Also, the animation
does not have to take place over time. As mentioned in the notes, the
frame can be set to a categorical variable. However, the categories the
frames cycle through should be organized (if needs be) such that the
progression through them shows some pattern.
This graph should include:
Aside from the graphing variable, a separate categorical
variable. For example, in our animated scatter plot we color grouped the
points by continent.
Appropriate axis labels and a title
Augment the frame label to make it more visible. This can include
changing the font size and color to make it stand out more, and/or
moving the frame label to a new location in the plotting region. Note,
if you do this, make sure it is till clearly visible and does not
obstruct the view of your plot.
Use the Gapminder dataset from the dslabs package to make a
scatterplot to see an association between gdp vs fertility rate
# Import data set
library(dslabs)
# Filter out the 2012 to 2016 years
gapminder %>%
filter(year < 2012) -> gap_filter
# General scatterplot (scale y axis as well)
gap_filter %>%
plot_ly(x = ~ fertility, y = ~ gdp,
hoverinfo = "text",
text = ~ paste("Country:", country, "<br>",
"Children:", fertility, "<br>",
"GDP:", gdp)) %>%
add_markers(frame = ~ year,
ids = ~ country,
size = ~ gdp, color = ~ continent) %>%
layout(xaxis = list(title = "Average Number of Children per Mother"),
yaxis = list(title = "Gross Domestic Product", type = "log"),
title = "GDP vs Fertility Rate for Countries Grouped by Continent") %>%
add_text(x = 5.5, y = 3e+12, text = ~ year, frame = ~ year,
textfont = list(size = 80, color = toRGB("gray"))) %>%
animation_slider(currentvalue = list(font = list(color = "white")))
Include at least a 1-paragraph discussion about the plot. Discuss
what you are plotting and what trends can be seen throughout the
animation. Discuss any issues, if any, you ran into in making the plot
and how you overcame them.
This data comes from the gapminder dataset from inside the dslabs
package. I was very interested in seeing how fertility rate affects the
countries GDP. I mentioned in the last assignment that countries with
lower life expectancies tend to have more children and given what those
countries were, I came to the conclusion that it is likely because with
the lower life expectancies, the families have more children for a
better chance of some of the children surviving to keep the family
lineage going. Those countries, in general, tended to be countries whose
societal norms were to have more kids even if the family was poor or
from a poor country. With this graph I expected to see a negative
association between the GDP of a country and the number of children they
had. I expected this hypothesis to be true but I was also curious how
this varied between the continents of the world. Again, as I mentioned
before, in general the more built up a country is, the more money they
have and the societal norms are not there unlike the poorer countries
whose culture is all based on the family and preserving the family
heritage. There are many features to this graph that give a lot of
information. One is the points themselves but what helps even more is
the fact that they are grouped by continent in their color so it is easy
to tell what continent they are from to see trends within the continent
and to compare the continents to one another. Next we have the text that
gives the name of the country represented by that point along with its
corresponding fertility rate and the GDP. This makes it easy to see what
country is which. Also the points have a size difference based on GDP.
The last main feature is the animated part of it. By pressing play on
the slider, it will make the same scatterplot from 1960 to 2011 and put
them over one another so it looks like it is moving. The text will
display which year it is for. The slider can be moved to any particular
year as well.
Next, I will discuss some issues I has to fix while making the graph
and explain some general trends. The first problem I had with the graph
was that I had to scale the y axis for GDP. Since the countries had so
much variability in this variable, I had to scale it in order for the
graph to display a nice picture. The next problem I had was that I
noticed that when I made the initial plot, after 2011, the graph stopped
moving and the years 2012-2016 just flew up on the screen without the
graph changing. I then investigated the data and saw that there was no
values for GDP for any observation from 2012-2016. I filtered the data
accordingly and then I was able to get the animated plot to just go from
1960 to 2011. The last problem I had was that I could not get the text
font to display correctly. I got it to print but then the size and color
did not show up. I noticed that when I followed the code from the
in-class example on how to display text, my code did not work then
either, even though it worked on your end. I investigated this some more
to make sure I actually used correct syntax in the function and it all
matched up. I then looked at the internet and after not finding many
answers, I used the help file for add_text() and after not finding much
there either, I eventually just messed with the code some more. In the
videos the function inside add_text() was textFont() but I then spelled
it textfont() and it worked. I am not exactly sure if the function got
renamed with different versions or if this was a mistake on my end, but
I was happy that I was able to figure that one out. The general trend in
the graph is that over the years the average fertility rate goes down
and there is not a big change in general GDP. From the beginning years
until about 1980, we can see that there are two pretty distinct groups
but then as the years go on, they mix in more together, and then at the
ending years, there is a large group with low fertility rates and a
smaller group with large fertility rates. In the early years, we can see
that a lot of the points in that low group are all pink, symbolizing
Europe. Then at the ending years we can see that most of that chunk of
high fertility rates are all darker green symbolizing Africa. These
points are also lower on the GDP scale which further proves my point
from before of poorer countries having more children. We see this in the
last assignment with life expectancies and it is seen here too which I
think is interesting. The more industrialized a country is, the less
children they have on average while the poorer countries tend to be more
on the traditional side of family heritage and have more kids
overall.
What to turn in:
knit your final assignment to an html document and publish it to
an RPubs page.
submit (1) the rmd file and (2) the link to this page in
Blackboard (this can be in a word document or some other form to submit
the link).
Published RPUBS Page