ggplot2
basicsDuring ANLY 512 we will be studying the theory and practice of
data visualization. We will be using R and the
packages within R to assemble data and construct many
different types of visualizations. We begin by studying some of the
theoretical aspects of visualization. To do that we must appreciate the
basic steps in the process of making a visualization.
The objective of this assignment is to complete and explain basic plots before moving on to more complicated ways to graph data.
A couple of tips, remember that there may be pre-processing involved in your graphics so you may have to do summaries or calculations to prepare, those should be included in your work.
To ensure accuracy pay close attention to axes and labels, you will be evaluated based on the accuracy and expository nature of your graphics. Make sure your axis labels are easy to understand and are comprised of full words with units if necessary.
Each question is worth 5 points.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.
nasaweather package, create a
scatter plot between wind and pressure, with color being used to
distinguish the type of storm.library(ggplot2)
library(reshape)
cleanup = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line.x = element_line(colour = "black"),
axis.line.y = element_line(colour = 'black'),
legend.key = element_rect(colour = "white"),
text = element_text(size = 12))
data(storms)
str(storms)
## tibble [2,747 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:2747] "Allison" "Allison" "Allison" "Allison" ...
## $ year : int [1:2747] 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ month : int [1:2747] 6 6 6 6 6 6 6 6 6 6 ...
## $ day : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## $ hour : int [1:2747] 0 6 12 18 0 6 12 18 0 6 ...
## $ lat : num [1:2747] 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
## $ long : num [1:2747] -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
## $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
## $ wind : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
## $ type : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
## $ seasday : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## Dataframe for Scatter Plot:
Nasa_weather_data<- storms[,-c(1,2,3,4,5,6,7,11)]
str(Nasa_weather_data)
## tibble [2,747 × 3] (S3: tbl_df/tbl/data.frame)
## $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
## $ wind : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
## $ type : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
## Scatter Plot Data Object layer:
Nasa_Weather_Scatter_Plot<-ggplot(Nasa_weather_data,
aes(x=wind, y=pressure, color =type ))
# Addition of subsequent layers to the grouped scatter object layer:
Nasa_Weather_Plot<- Nasa_Weather_Scatter_Plot + geom_point() +
labs(title = "Wind and Pressure Scales Scatter Plot for different Storm Types", x = "Wind Scale of Storm ", y = "Pressure Scale of Storm") + coord_cartesian(ylim = c(900, 1050), xlim = c(0 , 170)) +
scale_colour_discrete("Storm Types") + cleanup
## Saving Plot image:
ggsave("Nasa_Weather_Plot.png")
knitr::include_graphics("Nasa_Weather_Plot.png")
MLB_teams data in the mdsr package
to create an informative data graphic that illustrates the relationship
between winning percentage and payroll in context.library(reshape)
cleanup = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line.x = element_line(colour = "black"),
axis.line.y = element_line(colour = 'black'),
legend.key = element_rect(colour = "white"),
text = element_text(size = 11))
data(MLB_teams)
str(MLB_teams)
## tibble [210 × 11] (S3: tbl_df/tbl/data.frame)
## $ yearID : int [1:210] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ teamID : chr [1:210] "ARI" "ATL" "BAL" "BOS" ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 5 5 2 2 2 5 5 2 5 2 ...
## $ W : int [1:210] 82 72 68 95 89 97 74 81 74 74 ...
## $ L : int [1:210] 80 90 93 67 74 64 88 81 88 88 ...
## $ WPct : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
## $ attendance: int [1:210] 2509924 2532834 1950075 3048250 2500648 3300200 2058632 2169760 2650218 3202645 ...
## $ normAttend: num [1:210] 0.584 0.589 0.454 0.709 0.582 ...
## $ payroll : int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
## $ metroPop : num [1:210] 4489109 5614323 2785874 4732161 9554598 ...
## $ name : chr [1:210] "Arizona Diamondbacks" "Atlanta Braves" "Baltimore Orioles" "Boston Red Sox" ...
## Dataframe for Scatter Plot:
MLB_data_a<- MLB_teams[,-c(1,2,3,4,5,7,8,10,11)]
str(MLB_data_a)
## tibble [210 × 2] (S3: tbl_df/tbl/data.frame)
## $ WPct : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
## $ payroll: int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
### Simplistic Informative data graphic that illustrating the relationship between winning percentage and payroll.
## Scatter Plot Data Object layer:
MLB_Scatter_Plot_a<-ggplot(MLB_data_a,
aes(x=WPct, y=payroll ))
# Addition of subsequent layers to the grouped scatter object layer
MLB_a <- MLB_Scatter_Plot_a + geom_point() +
labs(title = "Winning Percentage and Payroll Scales Scatter Plot for different MLB Teams", x = "Winning Percentage Scale ", y = "Payroll Scale") +
coord_cartesian(ylim = c(18000000, 240000000), xlim = c(0.30 , 0.70)) +
geom_smooth(method = 'lm', se = TRUE) + cleanup
ggsave("MLB_a.png")
knitr::include_graphics("MLB_a.png")
### Additional Informative data graphic, illustrating the relationship between winning percentage and payroll and distinguished by different MLB Teams.
MLB_data_b<- MLB_teams[,-c(1,2,3,4,5,7,8,10)]
str(MLB_data_b)
## tibble [210 × 3] (S3: tbl_df/tbl/data.frame)
## $ WPct : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
## $ payroll: int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
## $ name : chr [1:210] "Arizona Diamondbacks" "Atlanta Braves" "Baltimore Orioles" "Boston Red Sox" ...
## Scatter Plot Data Object layer:
MLB_Scatter_Plot_b<-ggplot(MLB_data_b,
aes(x=WPct, y=payroll, color =name ))
# Addition of subsequent layers to the grouped scatter object layer
MLB_b <-MLB_Scatter_Plot_b + geom_point() +
labs(title = "Winning Percentage and Payroll Scales Scatter Plot for different MLB Teams", x = "Winning Percentage Scale ", y = "Payroll Scale") + coord_cartesian(ylim = c(18000000, 240000000), xlim = c(0.30 , 0.70)) +
scale_colour_discrete("MLB Team Names") + cleanup
## Saving Plot image:
ggsave("MLB_b.png")
knitr::include_graphics("MLB_b.png")
RailTrail data set from the mosaicData
package describes the usage of a rail trail in Western Massachusetts.
Use these data to answer the following questions.volume against the high temperature that dayweekday (an indicator
of weekend/holiday vs. weekday)cleanup = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line.x = element_line(colour = "black"),
axis.line.y = element_line(colour = 'black'),
legend.key = element_rect(colour = "white"),
text = element_text(size = 9))
## Exploration of the the data set.
data(RailTrail)
str(RailTrail)
## 'data.frame': 90 obs. of 11 variables:
## $ hightemp : int 83 73 74 95 44 69 66 66 80 79 ...
## $ lowtemp : int 50 49 52 61 52 54 39 38 55 45 ...
## $ avgtemp : num 66.5 61 63 78 48 61.5 52.5 52 67.5 62 ...
## $ spring : int 0 0 1 0 1 1 1 1 0 0 ...
## $ summer : int 1 1 0 1 0 0 0 0 1 1 ...
## $ fall : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cloudcover: num 7.6 6.3 7.5 2.6 10 ...
## $ precip : num 0 0.29 0.32 0 0.14 ...
## $ volume : int 501 419 397 385 200 375 417 629 533 547 ...
## $ weekday : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
## $ dayType : chr "weekday" "weekday" "weekday" "weekend" ...
### Answer:
## Determination for Scatter Plot for the number of crossings per day `volume` against the high temperature that day, and separated into facets by `weekday` Vs Weekend/holiday, and inclusive of regression line in the plot.
### Plot Layout-1
RT_a <-RailTrail %>%
mutate(Week_Day = ifelse(weekday, "Weekday", "Weekend-Holiday")) %>%
## Plot determination with facet wrap
ggplot(aes(x = hightemp, y = volume)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
# Facet Wrapping the plot by Weekday and in double row:
facet_wrap(~ Week_Day, nrow = 2) +
labs(title = "Scatter Plot of Number of Rail Crossings Volume against the High temperature per day",
x = "High Temperature Scale (Degree. °F) ", y = "Number of Rail Crossings Volume per Day") +
coord_cartesian(ylim = c(0, 800), xlim = c(40 , 100)) + cleanup
## Saving Plot image:
ggsave("RT_a.png")
knitr::include_graphics("RT_a.png")
### Alternate Layout-2 of the same plot:
RT_b <- RailTrail %>%
mutate(Week_Day = ifelse(weekday, "Weekday", "Weekend-Holiday")) %>%
## Plot determination with facet wrap
ggplot(aes(x = hightemp, y = volume)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
# Facet Wrapping the plot by Weekday and in a single row:
facet_wrap(~ Week_Day, nrow = 1) +
labs(title = "Scatter Plot of Number of Rail Crossings Volume against the High temperature per day",
x = "High Temperature Scale (Degree. °F) ", y = "Number of Rail Crossings Volume per Day") +
coord_cartesian(ylim = c(0, 800), xlim = c(40 , 100)) + cleanup
## Saving Plot image:
ggsave("RT_b.png")
knitr::include_graphics("RT_b.png")
nasaweather package, use the
geom_path function to plot the path of each tropical storm
in the storms data table. Use color to distinguish the
storms from one another, and use faceting to plot each year in its own
panel.cleanup = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line.x = element_line(colour = "black"),
axis.line.y = element_line(colour = 'black'),
legend.key = element_rect(colour = "white"),
text = element_text(size = 10))
## Extracting "storms" database object from nasaweather package.
data(storms)
str(storms)
## tibble [2,747 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:2747] "Allison" "Allison" "Allison" "Allison" ...
## $ year : int [1:2747] 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ month : int [1:2747] 6 6 6 6 6 6 6 6 6 6 ...
## $ day : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## $ hour : int [1:2747] 0 6 12 18 0 6 12 18 0 6 ...
## $ lat : num [1:2747] 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
## $ long : num [1:2747] -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
## $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
## $ wind : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
## $ type : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
## $ seasday : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## Determination for the Tropical Storms Path plot:
Tropical_Storm_Paths<- ggplot(storms, aes(x=lat, y=long)) +
geom_path(aes(col=name)) +
labs(title = "Plot for Tropical Storms Different Paths",
x = "Latitude Scale of Storms ", y = "Longitude Scale of Storms") + facet_wrap(~year, nrow = 2) +
scale_colour_discrete("Tropical Storms Names") + cleanup
ggsave("Tropical_Storm_Paths.png")
knitr::include_graphics("Tropical_Storm_Paths.png")
penguins data set from the
palmerpenguins package.## Extracting the dataset:
data(penguins)
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## Part a:
## Determination for scatterplot of bill_length_mm against bill_depth_mm, and individual penguins species colored and a regression line further added:
# Ggplot Object layer:
Penguins_Scatter_Plot_a<-ggplot(penguins,
aes(x=bill_depth_mm, y=bill_length_mm, color =species ))
# Addition of subsequent layers to the grouped scatter object layer
Penguin_a <- Penguins_Scatter_Plot_a + geom_point() +
labs(title = "Scatter Plot of Bill Length (mm) against Bill Depth (mm) for differnet Penguin Species ", x = "Bill Length Scale (mm) ", y = "Bill Depth Scale (mm)") + coord_cartesian(ylim = c(30, 60), xlim = c(10 , 25)) +
scale_colour_discrete("Penguin Species Name") + geom_smooth(method = "lm", se = FALSE) + cleanup
ggsave("Penguin_a.png")
knitr::include_graphics("Penguin_a.png")
### Part b:
## Scatter Plot for of bill_length_mm against bill_depth_mm for different penguins species, and separated into facets by species :
Penguins_Scatter_Plot_b<-ggplot(penguins,
aes(x=bill_depth_mm, y=bill_length_mm, color =species ))
# Addition of subsequent layers to the grouped scatter object layer
Penguin_b <- Penguins_Scatter_Plot_b + geom_point() +
labs(title = "Scatter Plot of Bill Length (mm) against Bill Depth (mm) for differnet Penguin Species ", x = "Bill Length Scale (mm) ", y = "Bill Depth Scale (mm)") + coord_cartesian(ylim = c(30, 60), xlim = c(11 , 23)) +
facet_wrap(~species, nrow = 3) +
scale_colour_discrete("Penguin Species Name") + geom_smooth(method = "lm", se = FALSE) + cleanup
ggsave("Penguin_b.png")
knitr::include_graphics("Penguin_b.png")
## Inference of bill_length_mm against bill_depth_mm association Findings for Penguin Species:
# The Scatter plot depicted a positive direction or trend relationship between the bill length and bill depth for all the penguin species, and it was also observed that the bill length scale range for the "Adele", and "Chinstrip" penguin species were almost identical, however, there was also a bill depth variation observed for these species, and further the bill length scale range for the third species i.e., "Gentoo" was observed to be smallest in contrast in comparison to other two penguin species, but the bill length for this "Gentoo" penguin species was also observed to be identical to a larger proportion with the "Chinstrap" species depth scale.