ggplot2
basicsDuring ANLY 512 we will be studying the theory and practice of
data visualization. We will be using R and the
packages within R to assemble data and construct many
different types of visualizations. We begin by studying some of the
theoretical aspects of visualization. To do that we must appreciate the
basic steps in the process of making a visualization.
The objective of this assignment is to complete and explain basic plots before moving on to more complicated ways to graph data.
A couple of tips, remember that there may be pre-processing involved in your graphics so you may have to do summaries or calculations to prepare, those should be included in your work.
To ensure accuracy pay close attention to axes and labels, you will be evaluated based on the accuracy and expository nature of your graphics. Make sure your axis labels are easy to understand and are comprised of full words with units if necessary.
Each question is worth 5 points.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.
nasaweather package, create a
scatter plot between wind and pressure, with color being used to
distinguish the type of storm.library(nasaweather)
head(storms)
## # A tibble: 6 × 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Allison 1995 6 3 0 17.4 -84.3 1005 30 Tropical D… 3
## 2 Allison 1995 6 3 6 18.3 -84.9 1004 30 Tropical D… 3
## 3 Allison 1995 6 3 12 19.3 -85.7 1003 35 Tropical S… 3
## 4 Allison 1995 6 3 18 20.6 -85.8 1001 40 Tropical S… 3
## 5 Allison 1995 6 4 0 22 -86 997 50 Tropical S… 4
## 6 Allison 1995 6 4 6 23.3 -86.3 995 60 Tropical S… 4
data <- storms
summary(data)
## name year month day
## Length:2747 Min. :1995 Min. : 6.000 Min. : 1.00
## Class :character 1st Qu.:1995 1st Qu.: 8.000 1st Qu.: 9.00
## Mode :character Median :1997 Median : 9.000 Median :18.00
## Mean :1997 Mean : 8.803 Mean :16.98
## 3rd Qu.:1999 3rd Qu.:10.000 3rd Qu.:25.00
## Max. :2000 Max. :12.000 Max. :31.00
## hour lat long pressure
## Min. : 0.000 Min. : 8.30 Min. :-107.30 Min. : 905.0
## 1st Qu.: 3.500 1st Qu.:17.25 1st Qu.: -77.60 1st Qu.: 980.0
## Median :12.000 Median :25.00 Median : -60.90 Median : 995.0
## Mean : 9.057 Mean :26.67 Mean : -60.87 Mean : 989.8
## 3rd Qu.:18.000 3rd Qu.:33.90 3rd Qu.: -45.80 3rd Qu.:1004.0
## Max. :18.000 Max. :70.70 Max. : 1.00 Max. :1019.0
## wind type seasday
## Min. : 15.00 Length:2747 Min. : 3.0
## 1st Qu.: 35.00 Class :character 1st Qu.: 84.0
## Median : 50.00 Mode :character Median :103.0
## Mean : 54.68 Mean :102.6
## 3rd Qu.: 70.00 3rd Qu.:125.0
## Max. :155.00 Max. :185.0
str(data)
## tibble [2,747 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:2747] "Allison" "Allison" "Allison" "Allison" ...
## $ year : int [1:2747] 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ month : int [1:2747] 6 6 6 6 6 6 6 6 6 6 ...
## $ day : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## $ hour : int [1:2747] 0 6 12 18 0 6 12 18 0 6 ...
## $ lat : num [1:2747] 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
## $ long : num [1:2747] -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
## $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
## $ wind : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
## $ type : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
## $ seasday : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
ggplot(data, aes(x = wind, y = pressure, color = type)) +
theme_bw() +
geom_point(alpha = 0.25) +
labs(title = "Scatterplot between Wind and Pressure",
x = "Wind",
y = "Pressure")
MLB_teams data in the mdsr package
to create an informative data graphic that illustrates the relationship
between winning percentage and payroll in context.library(mdsr)
data2 <- MLB_teams
summary(data2)
## yearID teamID lgID W L
## Min. :2008 Length:210 AA: 0 Min. : 51.00 Min. : 59.00
## 1st Qu.:2009 Class :character AL:100 1st Qu.: 73.00 1st Qu.: 72.00
## Median :2011 Mode :character FL: 0 Median : 81.00 Median : 81.00
## Mean :2011 NA: 0 Mean : 80.99 Mean : 80.99
## 3rd Qu.:2013 NL:110 3rd Qu.: 90.00 3rd Qu.: 89.00
## Max. :2014 PL: 0 Max. :103.00 Max. :111.00
## UA: 0
## WPct attendance normAttend payroll
## Min. :0.3148 Min. :1335076 Min. :0.3106 Min. : 17890700
## 1st Qu.:0.4506 1st Qu.:1940441 1st Qu.:0.4514 1st Qu.: 67325266
## Median :0.5000 Median :2418204 Median :0.5625 Median : 85803966
## Mean :0.5000 Mean :2481715 Mean :0.5773 Mean : 94365324
## 3rd Qu.:0.5556 3rd Qu.:3041615 3rd Qu.:0.7076 3rd Qu.:114741109
## Max. :0.6358 Max. :4298655 Max. :1.0000 Max. :231978886
##
## metroPop name
## Min. : 1572245 Length:210
## 1st Qu.: 2785874 Class :character
## Median : 4541584 Mode :character
## Mean : 6014841
## 3rd Qu.: 6490180
## Max. :20092883
##
str(data2)
## tibble [210 × 11] (S3: tbl_df/tbl/data.frame)
## $ yearID : int [1:210] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ teamID : chr [1:210] "ARI" "ATL" "BAL" "BOS" ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 5 5 2 2 2 5 5 2 5 2 ...
## $ W : int [1:210] 82 72 68 95 89 97 74 81 74 74 ...
## $ L : int [1:210] 80 90 93 67 74 64 88 81 88 88 ...
## $ WPct : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
## $ attendance: int [1:210] 2509924 2532834 1950075 3048250 2500648 3300200 2058632 2169760 2650218 3202645 ...
## $ normAttend: num [1:210] 0.584 0.589 0.454 0.709 0.582 ...
## $ payroll : int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
## $ metroPop : num [1:210] 4489109 5614323 2785874 4732161 9554598 ...
## $ name : chr [1:210] "Arizona Diamondbacks" "Atlanta Braves" "Baltimore Orioles" "Boston Red Sox" ...
ggplot(MLB_teams, aes(x = payroll/1000000, y = WPct*100)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Relationship Between Payroll & Winning Percentage',
x = 'Payroll ($ in Millions)',
y = 'Winning Percentage (%)') +
theme(plot.title = element_text(hjust = 0.5))
RailTrail data set from the mosaicData
package describes the usage of a rail trail in Western Massachusetts.
Use these data to answer the following questions.volume against the high temperature that dayweekday (an indicator
of weekend/holiday vs. weekday)library(mosaicData)
data3 <- RailTrail
summary(data3)
## hightemp lowtemp avgtemp spring
## Min. :41.00 Min. :19.00 Min. :33.00 Min. :0.0000
## 1st Qu.:59.25 1st Qu.:38.00 1st Qu.:48.62 1st Qu.:0.0000
## Median :69.50 Median :44.50 Median :55.25 Median :1.0000
## Mean :68.83 Mean :46.03 Mean :57.43 Mean :0.5889
## 3rd Qu.:77.75 3rd Qu.:53.75 3rd Qu.:64.50 3rd Qu.:1.0000
## Max. :97.00 Max. :72.00 Max. :84.00 Max. :1.0000
## summer fall cloudcover precip
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.650 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median : 6.400 Median :0.00000
## Mean :0.2778 Mean :0.1333 Mean : 5.807 Mean :0.09256
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 8.475 3rd Qu.:0.02000
## Max. :1.0000 Max. :1.0000 Max. :10.000 Max. :1.49000
## volume weekday dayType
## Min. :129.0 Mode :logical Length:90
## 1st Qu.:291.5 FALSE:28 Class :character
## Median :373.0 TRUE :62 Mode :character
## Mean :375.4
## 3rd Qu.:451.2
## Max. :736.0
str(data3)
## 'data.frame': 90 obs. of 11 variables:
## $ hightemp : int 83 73 74 95 44 69 66 66 80 79 ...
## $ lowtemp : int 50 49 52 61 52 54 39 38 55 45 ...
## $ avgtemp : num 66.5 61 63 78 48 61.5 52.5 52 67.5 62 ...
## $ spring : int 0 0 1 0 1 1 1 1 0 0 ...
## $ summer : int 1 1 0 1 0 0 0 0 1 1 ...
## $ fall : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cloudcover: num 7.6 6.3 7.5 2.6 10 ...
## $ precip : num 0 0.29 0.32 0 0.14 ...
## $ volume : int 501 419 397 385 200 375 417 629 533 547 ...
## $ weekday : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
## $ dayType : chr "weekday" "weekday" "weekday" "weekend" ...
#Part a
RailTrail %>%
ggplot(aes(x = volume , y = hightemp)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'High Temperature per Day VS Number of Crossings',
x = 'Number of Crossings per Day',
y = 'Highest Temperature (Degree F)') +
theme(plot.title = element_text(hjust = 0.5))
#Parts b & c
RailTrail_data = RailTrail %>%
mutate(weekday_l = factor(weekday,
levels = c(TRUE, FALSE),
labels = c('Weekday', 'Weekend')
)
)
RailTrail_data %>%
ggplot(aes(x = volume, y = hightemp)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(~weekday_l, ncol = 2) +
labs(title = 'High Temperature per Day VS Number of Crossings',
x = 'Number of Crossings per Day',
y = 'Highest Temperature (Degree F)') +
theme(plot.title = element_text(hjust = 0.5))
nasaweather package, use the
geom_path function to plot the path of each tropical storm
in the storms data table. Use color to distinguish the
storms from one another, and use faceting to plot each year in its own
panel.library(nasaweather)
s_data = storms %>% filter(type == 'Tropical Storm')
s_data %>% ggplot(aes(x = lat, y = long)) +
geom_path(aes(color = name)) +
facet_wrap(~year, nrow = 3) +
labs(title = 'Path of Tropical Storms',
x = 'Longitude',
y = 'Latitude') +
theme(plot.title = element_text(hjust = 0.5))
penguins data set from the
palmerpenguins package.library(palmerpenguins)
data5 <- penguins
summary(data5)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
str(data5)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
#Part a
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, fill = species)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Bill Length VS Bill Depth (By Penguin Species)',
x = 'Bill Length (mm)',
y = 'Bill Depth (mm)') +
theme(plot.title = element_text(hjust = 0.5))
# Observation: We may infer from the plot that there is a direct correlation between the bill depth and the bill length because the bill depth rises as the bill length rises. All of the data points for the three species are positioned closely together around the regression line. Because the slope is positive, we can see that all species exhibit a positive correlation between bill depth and length (i.e. there is an upward linear pattern). This demonstrates that bill length also grows along with bill depth. And all of the species in the graph share this trait.
#Part b
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, fill = species)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(~species) +
labs(title = 'Bill Length VS Bill Depth (By Penguin Species)',
x = 'Bill Length (mm)',
y = 'Bill Depth (mm)') +
theme(plot.title = element_text(hjust = 0.5))
# observation : By separating the species, we can observe that the relationships between bill length and depth are rather stable (similar slopes) across species, but that the ranges of these variables are varied (the groupings are clearly shown by the colours). Moreover, the "Chinstrap" and "Gentoo" species have values for bill length that fall between 40-60 mm, but the "Adelie" species has values that fall between 30-46mm. The highest bill depth is observed in "Adelie" species whereas "Gentoo" has the lowest bill depth.