Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. We begin by studying some of the theoretical aspects of visualization. To do that we must appreciate the basic steps in the process of making a visualization.

The objective of this assignment is to complete and explain basic plots before moving on to more complicated ways to graph data.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Questions

  1. Using data from the nasaweather package, create a scatter plot between wind and pressure, with color being used to distinguish the type of storm.
library(nasaweather)

head(storms)
## # A tibble: 6 × 11
##   name     year month   day  hour   lat  long pressure  wind type        seasday
##   <chr>   <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>         <int>
## 1 Allison  1995     6     3     0  17.4 -84.3     1005    30 Tropical D…       3
## 2 Allison  1995     6     3     6  18.3 -84.9     1004    30 Tropical D…       3
## 3 Allison  1995     6     3    12  19.3 -85.7     1003    35 Tropical S…       3
## 4 Allison  1995     6     3    18  20.6 -85.8     1001    40 Tropical S…       3
## 5 Allison  1995     6     4     0  22   -86        997    50 Tropical S…       4
## 6 Allison  1995     6     4     6  23.3 -86.3      995    60 Tropical S…       4
data <- storms
summary(data)
##      name                year          month             day       
##  Length:2747        Min.   :1995   Min.   : 6.000   Min.   : 1.00  
##  Class :character   1st Qu.:1995   1st Qu.: 8.000   1st Qu.: 9.00  
##  Mode  :character   Median :1997   Median : 9.000   Median :18.00  
##                     Mean   :1997   Mean   : 8.803   Mean   :16.98  
##                     3rd Qu.:1999   3rd Qu.:10.000   3rd Qu.:25.00  
##                     Max.   :2000   Max.   :12.000   Max.   :31.00  
##       hour             lat             long            pressure     
##  Min.   : 0.000   Min.   : 8.30   Min.   :-107.30   Min.   : 905.0  
##  1st Qu.: 3.500   1st Qu.:17.25   1st Qu.: -77.60   1st Qu.: 980.0  
##  Median :12.000   Median :25.00   Median : -60.90   Median : 995.0  
##  Mean   : 9.057   Mean   :26.67   Mean   : -60.87   Mean   : 989.8  
##  3rd Qu.:18.000   3rd Qu.:33.90   3rd Qu.: -45.80   3rd Qu.:1004.0  
##  Max.   :18.000   Max.   :70.70   Max.   :   1.00   Max.   :1019.0  
##       wind            type              seasday     
##  Min.   : 15.00   Length:2747        Min.   :  3.0  
##  1st Qu.: 35.00   Class :character   1st Qu.: 84.0  
##  Median : 50.00   Mode  :character   Median :103.0  
##  Mean   : 54.68                      Mean   :102.6  
##  3rd Qu.: 70.00                      3rd Qu.:125.0  
##  Max.   :155.00                      Max.   :185.0
str(data)
## tibble [2,747 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name    : chr [1:2747] "Allison" "Allison" "Allison" "Allison" ...
##  $ year    : int [1:2747] 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ month   : int [1:2747] 6 6 6 6 6 6 6 6 6 6 ...
##  $ day     : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
##  $ hour    : int [1:2747] 0 6 12 18 0 6 12 18 0 6 ...
##  $ lat     : num [1:2747] 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
##  $ long    : num [1:2747] -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
##  $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
##  $ wind    : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
##  $ type    : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
##  $ seasday : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
ggplot(data, aes(x = wind, y = pressure, color = type)) +
  theme_bw() +
  geom_point(alpha = 0.25) +
  labs(title = "Scatterplot between Wind and Pressure",
       x = "Wind",
       y = "Pressure")

  1. Use the MLB_teams data in the mdsr package to create an informative data graphic that illustrates the relationship between winning percentage and payroll in context.
library(mdsr)

data2 <- MLB_teams
summary(data2)
##      yearID        teamID          lgID           W                L         
##  Min.   :2008   Length:210         AA:  0   Min.   : 51.00   Min.   : 59.00  
##  1st Qu.:2009   Class :character   AL:100   1st Qu.: 73.00   1st Qu.: 72.00  
##  Median :2011   Mode  :character   FL:  0   Median : 81.00   Median : 81.00  
##  Mean   :2011                      NA:  0   Mean   : 80.99   Mean   : 80.99  
##  3rd Qu.:2013                      NL:110   3rd Qu.: 90.00   3rd Qu.: 89.00  
##  Max.   :2014                      PL:  0   Max.   :103.00   Max.   :111.00  
##                                    UA:  0                                    
##       WPct          attendance        normAttend        payroll         
##  Min.   :0.3148   Min.   :1335076   Min.   :0.3106   Min.   : 17890700  
##  1st Qu.:0.4506   1st Qu.:1940441   1st Qu.:0.4514   1st Qu.: 67325266  
##  Median :0.5000   Median :2418204   Median :0.5625   Median : 85803966  
##  Mean   :0.5000   Mean   :2481715   Mean   :0.5773   Mean   : 94365324  
##  3rd Qu.:0.5556   3rd Qu.:3041615   3rd Qu.:0.7076   3rd Qu.:114741109  
##  Max.   :0.6358   Max.   :4298655   Max.   :1.0000   Max.   :231978886  
##                                                                         
##     metroPop            name          
##  Min.   : 1572245   Length:210        
##  1st Qu.: 2785874   Class :character  
##  Median : 4541584   Mode  :character  
##  Mean   : 6014841                     
##  3rd Qu.: 6490180                     
##  Max.   :20092883                     
## 
str(data2)
## tibble [210 × 11] (S3: tbl_df/tbl/data.frame)
##  $ yearID    : int [1:210] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ teamID    : chr [1:210] "ARI" "ATL" "BAL" "BOS" ...
##  $ lgID      : Factor w/ 7 levels "AA","AL","FL",..: 5 5 2 2 2 5 5 2 5 2 ...
##  $ W         : int [1:210] 82 72 68 95 89 97 74 81 74 74 ...
##  $ L         : int [1:210] 80 90 93 67 74 64 88 81 88 88 ...
##  $ WPct      : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
##  $ attendance: int [1:210] 2509924 2532834 1950075 3048250 2500648 3300200 2058632 2169760 2650218 3202645 ...
##  $ normAttend: num [1:210] 0.584 0.589 0.454 0.709 0.582 ...
##  $ payroll   : int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
##  $ metroPop  : num [1:210] 4489109 5614323 2785874 4732161 9554598 ...
##  $ name      : chr [1:210] "Arizona Diamondbacks" "Atlanta Braves" "Baltimore Orioles" "Boston Red Sox" ...
ggplot(MLB_teams, aes(x = payroll/1000000, y = WPct*100)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(title = 'Relationship Between Payroll & Winning Percentage',
       x = 'Payroll ($ in Millions)', 
       y = 'Winning Percentage (%)') +
  theme(plot.title = element_text(hjust = 0.5))

  1. The RailTrail data set from the mosaicData package describes the usage of a rail trail in Western Massachusetts. Use these data to answer the following questions.
  1. Create a scatterplot of the number of crossings per day volume against the high temperature that day
  2. Separate your plot into facets by weekday (an indicator of weekend/holiday vs. weekday)
  3. Add regression lines to the two facets
library(mosaicData)

data3 <- RailTrail
summary(data3)
##     hightemp        lowtemp         avgtemp          spring      
##  Min.   :41.00   Min.   :19.00   Min.   :33.00   Min.   :0.0000  
##  1st Qu.:59.25   1st Qu.:38.00   1st Qu.:48.62   1st Qu.:0.0000  
##  Median :69.50   Median :44.50   Median :55.25   Median :1.0000  
##  Mean   :68.83   Mean   :46.03   Mean   :57.43   Mean   :0.5889  
##  3rd Qu.:77.75   3rd Qu.:53.75   3rd Qu.:64.50   3rd Qu.:1.0000  
##  Max.   :97.00   Max.   :72.00   Max.   :84.00   Max.   :1.0000  
##      summer            fall          cloudcover         precip       
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 3.650   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median : 6.400   Median :0.00000  
##  Mean   :0.2778   Mean   :0.1333   Mean   : 5.807   Mean   :0.09256  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.: 8.475   3rd Qu.:0.02000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :10.000   Max.   :1.49000  
##      volume       weekday          dayType         
##  Min.   :129.0   Mode :logical   Length:90         
##  1st Qu.:291.5   FALSE:28        Class :character  
##  Median :373.0   TRUE :62        Mode  :character  
##  Mean   :375.4                                     
##  3rd Qu.:451.2                                     
##  Max.   :736.0
str(data3)
## 'data.frame':    90 obs. of  11 variables:
##  $ hightemp  : int  83 73 74 95 44 69 66 66 80 79 ...
##  $ lowtemp   : int  50 49 52 61 52 54 39 38 55 45 ...
##  $ avgtemp   : num  66.5 61 63 78 48 61.5 52.5 52 67.5 62 ...
##  $ spring    : int  0 0 1 0 1 1 1 1 0 0 ...
##  $ summer    : int  1 1 0 1 0 0 0 0 1 1 ...
##  $ fall      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cloudcover: num  7.6 6.3 7.5 2.6 10 ...
##  $ precip    : num  0 0.29 0.32 0 0.14 ...
##  $ volume    : int  501 419 397 385 200 375 417 629 533 547 ...
##  $ weekday   : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
##  $ dayType   : chr  "weekday" "weekday" "weekday" "weekend" ...
#Part a
RailTrail %>%
  ggplot(aes(x = volume , y = hightemp)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'High Temperature per Day VS Number of Crossings', 
       x = 'Number of Crossings per Day',
       y = 'Highest Temperature (Degree F)') +
  theme(plot.title = element_text(hjust = 0.5))

#Parts b & c
RailTrail_data = RailTrail %>%
  mutate(weekday_l = factor(weekday,
                              levels = c(TRUE, FALSE),
                              labels = c('Weekday', 'Weekend')
                              )
         )

RailTrail_data %>%
  ggplot(aes(x = volume, y = hightemp)) +
    geom_point() +
    geom_smooth(method = 'lm') +
    facet_wrap(~weekday_l, ncol = 2) +
    labs(title = 'High Temperature per Day VS Number of Crossings',
      x = 'Number of Crossings per Day',
      y = 'Highest Temperature (Degree F)') +
   theme(plot.title = element_text(hjust = 0.5))

  1. Using data from the nasaweather package, use the geom_path function to plot the path of each tropical storm in the storms data table. Use color to distinguish the storms from one another, and use faceting to plot each year in its own panel.
library(nasaweather)
s_data = storms %>% filter(type == 'Tropical Storm')

s_data %>% ggplot(aes(x = lat, y = long)) +
  geom_path(aes(color = name)) +
  facet_wrap(~year, nrow = 3) +
  labs(title = 'Path of Tropical Storms',
       x = 'Longitude',
       y = 'Latitude') +
  theme(plot.title = element_text(hjust = 0.5))

  1. Using the penguins data set from the palmerpenguins package.
  1. Create a scatterplot of bill_length_mm against bill_depth_mm where individual species are colored and a regression line is added to each species. Add regression lines to all of your facets. What do you observe about the association of bill depth and bill length?
  2. Repeat the same scatterplot but now separate your plot into facets by species. How would you summarize the association between bill depth and bill length.
library(palmerpenguins)

data5 <- penguins
summary(data5)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
str(data5)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
#Part a
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, fill = species)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(title = 'Bill Length VS Bill Depth (By Penguin Species)',
       x = 'Bill Length (mm)',
       y = 'Bill Depth (mm)') +
   theme(plot.title = element_text(hjust = 0.5))

# Observation: We may infer from the plot that there is a direct correlation between the bill depth and the bill length because the bill depth rises as the bill length rises. All of the data points for the three species are positioned closely together around the regression line. Because the slope is positive, we can see that all species exhibit a positive correlation between bill depth and length (i.e. there is an upward linear pattern). This demonstrates that bill length also grows along with bill depth. And all of the species in the graph share this trait.


#Part b
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, fill = species)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  facet_wrap(~species) +
  labs(title = 'Bill Length VS Bill Depth (By Penguin Species)',
       x = 'Bill Length (mm)',
       y = 'Bill Depth (mm)') +
   theme(plot.title = element_text(hjust = 0.5))

# observation : By separating the species, we can observe that the relationships between bill length and depth are rather stable (similar slopes) across species, but that the ranges of these variables are varied (the groupings are clearly shown by the colours). Moreover, the "Chinstrap" and "Gentoo" species have values for bill length that fall between 40-60 mm, but the "Adelie" species has values that fall between 30-46mm. The highest bill depth is observed in "Adelie" species whereas "Gentoo" has the lowest bill depth.