General rule

Please show your work and submit your computer codes in order to get points. Providing correct answers without supporting details does not receive full credits. This HW covers:

You DO NOT have to submit your HW answers using typesetting software. However, your answers must be legible for grading. Please upload your answers to the course space.

Problem 1

Please refer to the NYC flight data nycflights13 that has been discussed in the lecture notes and whose manual can be found at https://cran.r-project.org/web/packages/nycflights13/index.html. We will use flights, a tibble from nycflights13.

You are interested in looking into the average arr_delay for 4 different month 12, 1, 7 and 8, for 3 different carrier “UA”, “AA” and “DL”, and for distance that are greater than 700 miles, since you suspect that colder months and longer distances may result in longer average arrival delays. Note that you need to extract observations from flights, and that you are required to use dplyr for this purpose.

The following tasks and questions are based on the extracted observations.

library(nycflights13)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data0 = flights %>%
   select(arr_delay,month,carrier,distance) %>%
   filter(month %in% c(1,3,7,8,12), carrier %in% c("UA","AA","DL"), distance>700) %>%
   mutate_if(is.character,as.factor)
   
data0 = na.omit(data0)
data0

(1.a) For each combination of the values of carrier and month, obtain the average arr_delay and obtain the average distance. Plot the average arr_delay against the average distance, use carrier as facet; add a title “Base plot” and center the title in the plot. This will be your base plot, say, as object p. Show the plot p.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
data1 = data0 %>%
   group_by(carrier, month) %>%
   summarise_at(vars(arr_delay, distance), list(mean))

data1
p = ggplot(data = data1, aes(y= arr_delay, x = distance))+geom_point()

p = p+facet_wrap(~carrier, nrow = 1)+labs(x='ave_distance', y= 'ave_arr_delay'
      ,title=('Base plot'))+theme(
  plot.title = element_text(hjust = 0.5))
p

(1.b) Modify p as follows to get a plot p1: connect the points for each carrier via one type of dashed line; code the 3 levels of carrier as \(\alpha_1\), \(\beta_{1,2}\) and \(\gamma^{[0]}\), and display them in the strip texts; change the legend title into “My \(\zeta\)” (this legend is induced when you connect points for each carrier by a type of line), and put the legend in horizontal direction at the bottom of the plot; add a title “With math expressions” and center the title in the plot. Show the plot p1.

carrierstg = c(expression(alpha[1]), expression(beta['1,2']), expression(gamma^{'[0]'}))

data1$CS = factor(data1$carrier, labels = carrierstg)

p1 = ggplot(data = data1, aes(x = distance, y= arr_delay))+geom_point()+
      theme( plot.title = element_text(hjust = 0.5)) +
     geom_line(aes(linetype = CS))+
     labs(linetype = expression(paste("My ", zeta, sep=""))) +
     scale_linetype_discrete(labels =carrierstg)

p1 = p1+facet_wrap(~CS, nrow = 1, labeller = label_parsed)+
     labs(x='ave_distance', y= 'ave_arr_delay' ,
     title=('Base plot\n With math expression'))+theme(
      legend.position = "bottom", 
     legend.direction = "horizontal")
p1

(1.c) Modify p1 as follows to get a plot p2: set the font size of strip text to be 12 and rotate the strip texts counterclockwise by 15 degrees; set the font size of the x-axis text to be 10 and rotate the x-axis text clockwise by 30 degrees; set the x-axis label as “\(\hat{\mu}\) for mean arrival delay”; add a title “With front and text adjustments” and center the title in the plot. Show the plot p2

p2 = p1 + theme(axis.text.x =element_text(size=10,angle=-30), 
                axis.title.x =element_text(size=10,angle=15), 
                strip.text=element_text(size=12,angle = 15)) +
                labs(x=expression(paste(hat(mu), ' for mean arrival delay'))
                     ,title=('Base plot \n With math expression \n With front and text adjustments'))
   

p2

Problem 2

This problem requires you to visualize the binary relationship between members of a karate club as an undirected graph. Please install the R library igraphdata, from which you can obtain the data set karate and work on it. Create a graph for karate. Once you obtain the graph, you will see that each vertex is annotated by a number or letter. What do the numbers or letters refer to? Do you see subgraphs of the graph? If so, what do these subgraphs mean?

Answer: 1. numbers refer to each member in the karate club, and ‘A’ stands for president John A. and ‘H’ for karate instructor Mr. Hi (pseudonyms). There are two subgraphs, Zachary studied conflict and fission in this network, as the karate club was split into two separate clubs, after long disputes between two factions of the club, one led by John A., the other by Mr. Hi.

The ‘Faction’ vertex attribute gives the faction memberships of the actors. After the split of the club, club members chose their new clubs based on their factions, except actor no. 9, who was in John A.’s faction but chose Mr. Hi’s club

#install.packages('igraphdata')
#install.packages('igraph')
library(igraphdata)
## Warning: package 'igraphdata' was built under R version 4.1.2
library(igraph)
## Warning: package 'igraph' was built under R version 4.1.2
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
data(karate)

plot(karate)

Problem 3

This problem requires to to create an interactive plot using plotly. If you want to display properly the plot in your HW answers, you may well need to set your HW document as an html file (instead of doc, docx or pdf file) when you compile your R codes.

Please use the mpg data set we have discussed in the lectures. Create an interactive, scatter plot between “highway miles per gallon” hwy (on the y-axis) and “engine displacement in litres” displ (on the x-axis) with the color aesthetic designated by “number of cylinders” cyl, and set the x-axis label as “highway miles per gallon” and y-axis label as “highway miles per gallon”. You need to check the object type for cyl and set it correctly when creating the plot. Add the title “# of cylinders” to the legend and adjust the vertical position of the legend, if you can. For the last, you may look through https://plotly.com/r/legend/ for help.

#install.packages('plotly')

library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
mpg$cyl = as.factor(mpg$cyl)
#mpg
plot_ly(mpg,x = ~displ, y= ~hwy, color = ~cyl, type = "scatter", width = 700, height = 400) %>%
      layout(xaxis = list(title="engine displacement in liters"), yaxis = list(title='highway miles per gallon'),
             legend=list(title=list(text='# of cylinders')))
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
#p