Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. We begin by studying some of the theoretical aspects of visualization. To do that we must appreciate the basic steps in the process of making a visualization.

The objective of this assignment is to complete and explain basic plots before moving on to more complicated ways to graph data.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Questions

  1. Using data from the nasaweather package, create a scatter plot between wind and pressure, with color being used to distinguish the type of storm.
library(ggplot2)

library(reshape)

cleanup = theme(panel.background = element_blank(), 
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                axis.line.x = element_line(colour = "black"),
                axis.line.y = element_line(colour = 'black'),
                legend.key = element_rect(colour = "white"),
                text = element_text(size = 12))

data(storms)

str(storms)
## tibble [2,747 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name    : chr [1:2747] "Allison" "Allison" "Allison" "Allison" ...
##  $ year    : int [1:2747] 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ month   : int [1:2747] 6 6 6 6 6 6 6 6 6 6 ...
##  $ day     : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
##  $ hour    : int [1:2747] 0 6 12 18 0 6 12 18 0 6 ...
##  $ lat     : num [1:2747] 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
##  $ long    : num [1:2747] -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
##  $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
##  $ wind    : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
##  $ type    : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
##  $ seasday : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## Dataframe for Scatter Plot:

Nasa_weather_data<- storms[,-c(1,2,3,4,5,6,7,11)]

str(Nasa_weather_data)
## tibble [2,747 × 3] (S3: tbl_df/tbl/data.frame)
##  $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
##  $ wind    : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
##  $ type    : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
## Scatter Plot Data Object layer:

Nasa_Weather_Scatter_Plot<-ggplot(Nasa_weather_data, 
                                   aes(x=wind, y=pressure, color =type ))

# Addition of subsequent layers to the grouped scatter object layer:

Nasa_Weather_Plot<- Nasa_Weather_Scatter_Plot + geom_point() +
  labs(title = "Wind and Pressure Scales Scatter Plot for different Storm Types", x = "Wind Scale of Storm ", y = "Pressure Scale of Storm") + coord_cartesian(ylim = c(900, 1050), xlim = c(0 , 170)) + 
  scale_colour_discrete("Storm Types") + cleanup

##  Saving Plot image:

ggsave("Nasa_Weather_Plot.png")

knitr::include_graphics("Nasa_Weather_Plot.png")

  1. Use the MLB_teams data in the mdsr package to create an informative data graphic that illustrates the relationship between winning percentage and payroll in context.
library(reshape)

cleanup = theme(panel.background = element_blank(), 
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                axis.line.x = element_line(colour = "black"),
                axis.line.y = element_line(colour = 'black'),
                legend.key = element_rect(colour = "white"),
                text = element_text(size = 11))


data(MLB_teams)

str(MLB_teams)
## tibble [210 × 11] (S3: tbl_df/tbl/data.frame)
##  $ yearID    : int [1:210] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ teamID    : chr [1:210] "ARI" "ATL" "BAL" "BOS" ...
##  $ lgID      : Factor w/ 7 levels "AA","AL","FL",..: 5 5 2 2 2 5 5 2 5 2 ...
##  $ W         : int [1:210] 82 72 68 95 89 97 74 81 74 74 ...
##  $ L         : int [1:210] 80 90 93 67 74 64 88 81 88 88 ...
##  $ WPct      : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
##  $ attendance: int [1:210] 2509924 2532834 1950075 3048250 2500648 3300200 2058632 2169760 2650218 3202645 ...
##  $ normAttend: num [1:210] 0.584 0.589 0.454 0.709 0.582 ...
##  $ payroll   : int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
##  $ metroPop  : num [1:210] 4489109 5614323 2785874 4732161 9554598 ...
##  $ name      : chr [1:210] "Arizona Diamondbacks" "Atlanta Braves" "Baltimore Orioles" "Boston Red Sox" ...
## Dataframe for Scatter Plot:

MLB_data_a<- MLB_teams[,-c(1,2,3,4,5,7,8,10,11)]

str(MLB_data_a)
## tibble [210 × 2] (S3: tbl_df/tbl/data.frame)
##  $ WPct   : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
##  $ payroll: int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
### Simplistic Informative data graphic that illustrating the relationship between winning percentage and payroll.


## Scatter Plot Data Object layer:

MLB_Scatter_Plot_a<-ggplot(MLB_data_a, 
                                   aes(x=WPct, y=payroll ))


# Addition of subsequent layers to the grouped scatter object layer

MLB_a <- MLB_Scatter_Plot_a + geom_point() +
  labs(title = "Winning Percentage and Payroll Scales Scatter Plot for different MLB Teams", x = "Winning Percentage Scale ", y = "Payroll Scale") + 
  coord_cartesian(ylim = c(18000000, 240000000), xlim = c(0.30 , 0.70)) + 
  geom_smooth(method = 'lm', se = TRUE) + cleanup

ggsave("MLB_a.png")


knitr::include_graphics("MLB_a.png")

### Additional Informative data graphic, illustrating the relationship between winning percentage and payroll and distinguished by different MLB Teams.


MLB_data_b<- MLB_teams[,-c(1,2,3,4,5,7,8,10)]

str(MLB_data_b)
## tibble [210 × 3] (S3: tbl_df/tbl/data.frame)
##  $ WPct   : num [1:210] 0.506 0.444 0.422 0.586 0.546 ...
##  $ payroll: int [1:210] 66202712 102365683 67196246 133390035 121189332 118345833 74117695 78970066 68655500 137685196 ...
##  $ name   : chr [1:210] "Arizona Diamondbacks" "Atlanta Braves" "Baltimore Orioles" "Boston Red Sox" ...
## Scatter Plot Data Object layer:

MLB_Scatter_Plot_b<-ggplot(MLB_data_b, 
                                   aes(x=WPct, y=payroll, color =name ))

# Addition of subsequent layers to the grouped scatter object layer

MLB_b <-MLB_Scatter_Plot_b + geom_point() +
  labs(title = "Winning Percentage and Payroll Scales Scatter Plot for different MLB Teams", x = "Winning Percentage Scale ", y = "Payroll Scale") + coord_cartesian(ylim = c(18000000, 240000000), xlim = c(0.30 , 0.70)) + 
  scale_colour_discrete("MLB Team Names") + cleanup

##  Saving Plot image:

ggsave("MLB_b.png")

knitr::include_graphics("MLB_b.png")

  1. The RailTrail data set from the mosaicData package describes the usage of a rail trail in Western Massachusetts. Use these data to answer the following questions.
  1. Create a scatterplot of the number of crossings per day volume against the high temperature that day
  2. Separate your plot into facets by weekday (an indicator of weekend/holiday vs. weekday)
  3. Add regression lines to the two facets.
cleanup = theme(panel.background = element_blank(), 
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                axis.line.x = element_line(colour = "black"),
                axis.line.y = element_line(colour = 'black'),
                legend.key = element_rect(colour = "white"),
                text = element_text(size = 9))


## Exploration of the the data set.

data(RailTrail)

str(RailTrail)
## 'data.frame':    90 obs. of  11 variables:
##  $ hightemp  : int  83 73 74 95 44 69 66 66 80 79 ...
##  $ lowtemp   : int  50 49 52 61 52 54 39 38 55 45 ...
##  $ avgtemp   : num  66.5 61 63 78 48 61.5 52.5 52 67.5 62 ...
##  $ spring    : int  0 0 1 0 1 1 1 1 0 0 ...
##  $ summer    : int  1 1 0 1 0 0 0 0 1 1 ...
##  $ fall      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cloudcover: num  7.6 6.3 7.5 2.6 10 ...
##  $ precip    : num  0 0.29 0.32 0 0.14 ...
##  $ volume    : int  501 419 397 385 200 375 417 629 533 547 ...
##  $ weekday   : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
##  $ dayType   : chr  "weekday" "weekday" "weekday" "weekend" ...
### Answer:

## Determination for Scatter Plot for the number of crossings per day `volume` against the high temperature that day, and separated into facets by `weekday` Vs Weekend/holiday, and inclusive of regression line in the plot.


### Plot Layout-1

RT_a <-RailTrail %>% 
  
  mutate(Week_Day = ifelse(weekday, "Weekday", "Weekend-Holiday")) %>% 
  
  ## Plot determination with facet wrap
  
  ggplot(aes(x = hightemp, y = volume)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +  
  
  # Facet Wrapping the plot by Weekday and in double row:
  
  facet_wrap(~ Week_Day, nrow = 2) +
  
  labs(title = "Scatter Plot of Number of Rail Crossings Volume against the High temperature per day", 
       x = "High Temperature Scale (Degree. °F) ", y = "Number of Rail Crossings Volume per Day") + 
  coord_cartesian(ylim = c(0, 800), xlim = c(40 , 100)) + cleanup


##  Saving Plot image:

ggsave("RT_a.png")

knitr::include_graphics("RT_a.png")

### Alternate Layout-2 of the same plot:


RT_b <- RailTrail %>% 
  
  mutate(Week_Day = ifelse(weekday, "Weekday", "Weekend-Holiday")) %>% 
  
  ## Plot determination with facet wrap
  
  ggplot(aes(x = hightemp, y = volume)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +  
  
  # Facet Wrapping the plot by Weekday and in a single row:
  
  facet_wrap(~ Week_Day, nrow = 1) +
  
  labs(title = "Scatter Plot of Number of Rail Crossings Volume against the High temperature per day", 
       x = "High Temperature Scale (Degree. °F) ", y = "Number of Rail Crossings Volume per Day") + 
  coord_cartesian(ylim = c(0, 800), xlim = c(40 , 100)) + cleanup
  
  

##  Saving Plot image:

ggsave("RT_b.png")

knitr::include_graphics("RT_b.png")

  1. Using data from the nasaweather package, use the geom_path function to plot the path of each tropical storm in the storms data table. Use color to distinguish the storms from one another, and use faceting to plot each year in its own panel.
cleanup = theme(panel.background = element_blank(), 
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                axis.line.x = element_line(colour = "black"),
                axis.line.y = element_line(colour = 'black'),
                legend.key = element_rect(colour = "white"),
                text = element_text(size = 10))


## Extracting "storms" database object from nasaweather package.

data(storms)

str(storms)
## tibble [2,747 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name    : chr [1:2747] "Allison" "Allison" "Allison" "Allison" ...
##  $ year    : int [1:2747] 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ month   : int [1:2747] 6 6 6 6 6 6 6 6 6 6 ...
##  $ day     : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
##  $ hour    : int [1:2747] 0 6 12 18 0 6 12 18 0 6 ...
##  $ lat     : num [1:2747] 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
##  $ long    : num [1:2747] -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
##  $ pressure: int [1:2747] 1005 1004 1003 1001 997 995 987 988 988 990 ...
##  $ wind    : int [1:2747] 30 30 35 40 50 60 65 65 65 60 ...
##  $ type    : chr [1:2747] "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
##  $ seasday : int [1:2747] 3 3 3 3 4 4 4 4 5 5 ...
## Determination for the Tropical Storms Path plot:

Tropical_Storm_Paths<- ggplot(storms, aes(x=lat, y=long)) +
  
  geom_path(aes(col=name)) +
  
  labs(title = "Plot for Tropical Storms Different Paths", 
       x = "Latitude Scale of Storms ", y = "Longitude Scale of Storms") + facet_wrap(~year, nrow = 2) + 
  scale_colour_discrete("Tropical Storms Names") + cleanup



ggsave("Tropical_Storm_Paths.png")

knitr::include_graphics("Tropical_Storm_Paths.png")

  1. Using the penguins data set from the palmerpenguins package.
  1. Create a scatterplot of bill_length_mm against bill_depth_mm where individual species are colored and a regression line is added to each species. Add regression lines to all of your facets. What do you observe about the association of bill depth and bill length?
  2. Repeat the same scatterplot but now separate your plot into facets by species. How would you summarize the association between bill depth and bill length.
## Extracting the dataset:

data(penguins)

str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## Part a:

## Determination for scatterplot of bill_length_mm against bill_depth_mm, and individual penguins species colored and a regression line further added:

# Ggplot Object layer:

Penguins_Scatter_Plot_a<-ggplot(penguins, 
                                   aes(x=bill_depth_mm, y=bill_length_mm, color =species ))

# Addition of subsequent layers to the grouped scatter object layer

Penguin_a <- Penguins_Scatter_Plot_a + geom_point() +
  labs(title = "Scatter Plot of Bill Length (mm) against Bill Depth (mm) for differnet Penguin Species ", x = "Bill Length Scale (mm) ", y = "Bill Depth Scale (mm)") + coord_cartesian(ylim = c(30, 60), xlim = c(10 , 25)) + 
  scale_colour_discrete("Penguin Species Name") + geom_smooth(method = "lm", se = FALSE) + cleanup


ggsave("Penguin_a.png")

knitr::include_graphics("Penguin_a.png")

### Part b:

## Scatter Plot for of bill_length_mm against bill_depth_mm for different penguins species, and separated into facets by species :

Penguins_Scatter_Plot_b<-ggplot(penguins, 
                                   aes(x=bill_depth_mm, y=bill_length_mm, color =species ))

# Addition of subsequent layers to the grouped scatter object layer

Penguin_b <- Penguins_Scatter_Plot_b + geom_point() +
  labs(title = "Scatter Plot of Bill Length (mm) against Bill Depth (mm) for differnet Penguin Species ", x = "Bill Length Scale (mm) ", y = "Bill Depth Scale (mm)") + coord_cartesian(ylim = c(30, 60), xlim = c(11 , 23)) + 
  facet_wrap(~species, nrow = 3) +
  scale_colour_discrete("Penguin Species Name") + geom_smooth(method = "lm", se = FALSE) + cleanup


ggsave("Penguin_b.png")

knitr::include_graphics("Penguin_b.png")

## Inference of bill_length_mm against bill_depth_mm association Findings for Penguin Species: 

# The Scatter plot depicted a positive direction or trend relationship between the bill length and bill depth for all the penguin species, and it was also observed that the bill length scale range for the "Adele", and "Chinstrip" penguin species were almost identical, however, there was also a bill depth variation observed for these species, and further the bill length scale range for the third species i.e., "Gentoo" was observed to be smallest in contrast in comparison to other two penguin species, but the bill length for this "Gentoo" penguin species was also observed to be identical to a larger proportion with the "Chinstrap" species depth scale.