ggplot2 basicsDuring ANLY 512 we will be studying the theory and practice of
data visualization. We will be using R and the
packages within R to assemble data and construct many
different types of visualizations. We begin by studying some of the
theoretical aspects of visualization. To do that we must appreciate the
basic steps in the process of making a visualization.
The objective of this assignment is to complete and explain basic plots before moving on to more complicated ways to graph data.
A couple of tips, remember that there may be preprocessing involved in your graphics so you may have to do summaries or calculations to prepare, those should be included in your work.
To ensure accuracy pay close attention to axes and labels, you will be evaluated based on the accuracy and expository nature of your graphics. Make sure your axis labels are easy to understand and are comprised of full words with units if necessary.
Each question is worth 5 points.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
nasaweather package, create a
scatterplot between wind and pressure, with color being used to
distinguish the type of storm.Summary: We can see from the plot that a storm’s wind speed and air pressure have a negative relationship - the higher the wind speed, the lower the air pressure of the storm will be. Besides, Hurricane has the highest wind speed among all the storm types.
# str(storms)
# head(storms)
# colnames(storms)
ggplot(storms, aes(wind, pressure, color = type)) +
geom_point()+
labs(title = 'The relationship between Wind vs Pressure by Storm Type',
color = "Type",
x = 'Wind',
y = 'Pressure')
MLB_teams data in the mdsr package
to create an informative data graphic that illustrates the relationship
between winning percentage and payroll in context.It seems that the higher the winnning percentage of a team, the higher total payroll a team would get. But the variation of the team payroll of a certain winning percentage is large. So the relationship is not strong.
# head(MLB_teams)
ggplot(MLB_teams, aes(x = WPct*100, y = payroll/1000000)) +
geom_point() +
geom_smooth() +
# Add Dollar Sign for Axis Labels
scale_y_continuous(labels=scales::dollar_format()) +
labs(
title = "The Relationship Between Team's Winning Percentage and Payroll",
x = 'Winning Perecntage (%)',
y = 'Team Payroll (Millions)',
)
RailTrail data set from the mosaicData
package describes the usage of a rail trail in Western Massachusetts.
Use these data to answer the following questions.volume against the high temperature that dayweekday (an indicator
of weekend/holiday vs. weekday)We find that on weekdays, there’s a stronger positive relationship between the number of trail users and the day’s high temperature. The warmer the weather is, the more users cross the trail. The trend applies to holidays/weekends, but it is not so much linear.
# head(RailTrail)
# unique(RailTrail$dayType)
RailTrail2 <- RailTrail %>%
mutate(weekday = ifelse(weekday, "Weekday", "Holidays/Weekends"))
ggplot(RailTrail2, aes(x = hightemp, y = volume)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ weekday, nrow = 1) +
labs(
title = "The relationship between Trail Volume vs High Temperature by Day Type",
x = "Daily High Temperature (F)",
y = "Trail Volume")
nasaweather package, use the
geom_path function to plot the path of each tropical storm
in the storms data table. Use color to distinguish the
storms from one another, and use faceting to plot each year in its own
panel.There are least number of tropical storms in 1997. The path of the storms are generally more across different latitudes than longitudes.
# head(storms)
tropical_storms <- storms %>%
filter(type=='Tropical Storm')
# head(tropical_storms)
ggplot(tropical_storms, aes(x = lat, y = long, color = name)) +
geom_path() +
facet_wrap(~year, ncol = 2) +
labs(
title = "The Path of Tropical Storms (1995 - 2000)",
col = "Storm Name",
x = "Latitude",
y = "Longitude"
)
penguins data set from the
palmerpenguins package.There’s a positive relationship between a penguin’s bill length and bill depth for each species. Generally, the longer the bill length, the higher the bill depth. Specifically, Gentoo has the lowest bill depths among the 3 while Adelie generally has the shortest bill length.
We can see the bill depth difference among the species even closer using facet that Gentoo’s bill depth is generally lower than the other two. And Adelie and Chinstrap share a similar range of bill depth. Setting the faceting column to be 1, we can better compare the bill length among the three species. And we find that besides Adelie generally has the shortest bill length, Gentoo and Chinstrap share a similar range of bill length.
# head(penguins)
# unique(penguins$species)
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm", aes(color = species)) +
labs(
color = 'Species', #legend label
title = 'The Relationship Between Bill Length vs Depth By Penguin Species',
x = 'Bill Length (mm)',
y = 'Bill Depth (mm)')
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm", aes(color = species)) +
facet_wrap(~ species, nrow = 1) +
labs(
color = 'Species', #legend label
title = 'The Relationship Between Bill Length vs Depth By Penguin Species',
x = 'Bill Length (mm)',
y = 'Bill Depth (mm)')
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm", aes(color = species)) +
facet_wrap(~ species, ncol = 1) +
labs(
color = 'Species', #legend label
title = 'The Relationship Between Bill Length vs Depth By Penguin Species',
x = 'Bill Length (mm)',
y = 'Bill Depth (mm)')