#loading xfun to allow for knitting to html

library(xfun)
## Warning: package 'xfun' was built under R version 4.4.2
## 
## Attaching package: 'xfun'
## The following object is masked from 'package:base':
## 
##     attr

General rule

Please show your work and submit your computer codes in order to get points. Providing correct answers without supporting details does not receive full credits. This HW covers:

For an assignment or project, you DO NOT have to submit your answers or reports using typesetting software. However, your answers must be well organized and well legible for grading. Please upload your answers in a document to the course space. Specifically, if you are not able to knit a .Rmd/.rmd file into an output file such as a .pdf, .doc, .docx or .html file that contains your codes, outputs from your codes, your interpretations on the outputs, and your answers in text (possibly with math expressions), please organize your codes, their outputs and your answers in a document in the format given below:

Problem or task or question ... 
Codes ...
Outputs ...
Your interpretations ...

It is absolutely not OK to just submit your codes only. This will result in a considerable loss of points on your assignments or projects.

Problem 1

Please refer to the NYC flight data nycflights13 that has been discussed in the lecture notes and whose manual can be found at https://cran.r-project.org/web/packages/nycflights13/index.html. We will use flights, a tibble from nycflights13.

#loading necessary libraries and reading in the flight data
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.4.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
data(flights)

library(igraph)
## Warning: package 'igraph' was built under R version 4.4.2
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following object is masked from 'package:xfun':
## 
##     tree
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(igraphdata)
## Warning: package 'igraphdata' was built under R version 4.4.2
data(karate)

library(plotly)
## Warning: package 'plotly' was built under R version 4.4.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data(mpg)

You are interested in looking into the average arr_delay for 4 different month 12, 1, 7 and 8, for 3 different carrier “UA”, “AA” and “DL”, and for distance that are greater than 700 miles, since you suspect that colder months and longer distances may result in longer average arrival delays. Note that you need to extract observations from flights, and that you are required to use dplyr for this purpose.

#creating the data subset per filter specifications

flights2 <- filter(flights,
                   month %in% c(12, 1, 7, 8),
                   carrier %in% c("UA", "AA", "DL"),
                   distance > 700)

flights2 <- na.omit(flights2)

The following tasks and questions are based on the extracted observations.

#1.a For each combination of the values of carrier and month, obtain the average arr_delay and obtain the average distance. Plot the average arr_delay against the average distance, use carrier as facet; add a title “Base plot” and center the title in the plot. This will be your base plot, say, as object p. Show the plot p.

#grouping the data by carrier and month
flights_combos <- flights2 %>%
  group_by(carrier, month) %>%
#getting the averages for arrival delay and distance  
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    avg_distance = mean(distance, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
#creating the plot
p <- ggplot(flights_combos, aes(x = avg_distance, y = avg_arr_delay)) +
  geom_point() +
  facet_wrap(~ carrier) +
  ggtitle("Base plot") +
  theme(plot.title = element_text(hjust = 0.5))

#printing the plot
p

For carrier AA, the average distance was fairly constant across the 4 months, while the average arrival delay was quite varied between -2.5 and 7.5. For DL a similar pattern is shown, although with more variance in the average delay, about -4 to 16. UA does have one month where the average distance was lower than 2 of the other months and the average delay was lower, however the other 3 months display a similar pattern to the other two carriers.

#1.b Modify p as follows to get a plot p1: connect the points for each carrier via one type of dashed line; code the 3 levels of carrier as \(\alpha_1\), \(\beta_{1,2}\) and \(\gamma^{[0]}\), and display them in the strip texts; change the legend title into “My \(\zeta\)” (this legend is induced when you connect points for each carrier by a type of line), and put the legend in horizontal direction at the bottom of the plot; add a title “With math expressions” and center the title in the plot. Show the plot p1.

#labeling the carriers with the new level codes
carrier_labels <- c(expression(alpha[1]), expression(beta['1,2']), expression(gamma^{'[0]'}))

flights_combos$carrier_factor = factor(flights_combos$carrier,labels = carrier_labels)

#creating the new plot
p1 <- ggplot(flights_combos, aes(x = avg_distance, y = avg_arr_delay)) +
  geom_point() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_line(aes(linetype = carrier_factor))+
  labs(linetype = expression(paste("My ", zeta, sep=""))) +
  scale_linetype_discrete(labels =carrier_labels) +
  
  facet_wrap(~ carrier_factor, labeller = as_labeller(label_parsed)) +
     labs(x='avg distance', y= 'avg arr delay' ,
     title=('Base plot with math expression'))+theme(
      legend.position = "bottom", 
     legend.direction = "horizontal")

#printing the plot
p1

The plot now shows each carrier as an expression and maps the relationship month to month with a line. The lines further support the trends seen in part a, that carriers AA and DL have similar patterns, while UA varies from the trend. UA would indicate that there is a certain average distance that lowers average arrival delay, while the other two don’t have as strong a pattern, showing more variability.

#1.c Modify p1 as follows to get a plot p2: set the font size of strip text to be 12 and rotate the strip texts counterclockwise by 15 degrees; set the font size of the x-axis text to be 10 and rotate the x-axis text clockwise by 30 degrees; set the x-axis label as “\(\hat{\mu}\) for mean arrival delay”; add a title “With front and text adjustments” and center the title in the plot. Show the plot p2

p2 = p1 +
  theme(axis.text.x =element_text(size=10,angle = 30),
        axis.title.x =element_text(size=10,angle = 30),
        strip.text=element_text(size=12,angle = -15)) +
  labs(x=expression(paste(hat(mu), ' for mean arrival delay'))
       ,title=('Base plot with math expression \n With front and text adjustments'))
   
#printing the plot
p2

Same graph as seen in 1b, but with the requested adjustments.

Problem 2

This problem requires you to visualize the binary relationship between members of a karate club as an undirected graph. Please install the R library igraphdata, from which you can obtain the data set karate and work on it. Create a graph for karate. Once you obtain the graph, you will see that each vertex is annotated by a number or letter. What do the numbers or letters refer to? Do you see subgraphs of the graph? If so, what do these subgraphs mean?

g <- plot(karate)
## This graph was created by an old(er) igraph version.
## ℹ Call `igraph::upgrade_graph()` on it to use with the current igraph version.
## For now we convert it on the fly...

g
## NULL
summary(karate)
## IGRAPH 4b458a1 UNW- 34 78 -- Zachary's karate club network
## + attr: name (g/c), Citation (g/c), Author (g/c), Faction (v/n), name
## | (v/c), label (v/c), color (v/n), weight (e/n)
vertex_attr(karate)
## $Faction
##  [1] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## 
## $name
##  [1] "Mr Hi"    "Actor 2"  "Actor 3"  "Actor 4"  "Actor 5"  "Actor 6" 
##  [7] "Actor 7"  "Actor 8"  "Actor 9"  "Actor 10" "Actor 11" "Actor 12"
## [13] "Actor 13" "Actor 14" "Actor 15" "Actor 16" "Actor 17" "Actor 18"
## [19] "Actor 19" "Actor 20" "Actor 21" "Actor 22" "Actor 23" "Actor 24"
## [25] "Actor 25" "Actor 26" "Actor 27" "Actor 28" "Actor 29" "Actor 30"
## [31] "Actor 31" "Actor 32" "Actor 33" "John A"  
## 
## $label
##  [1] "H"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "A" 
## 
## $color
##  [1] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
graph_attr(karate)
## $name
## [1] "Zachary's karate club network"
## 
## $Citation
## [1] "Wayne W. Zachary. An Information Flow Model for Conflict and Fission in Small Groups. Journal of Anthropological Research Vol. 33, No. 4 452-473"
## 
## $Author
## [1] "Wayne W. Zachary"

The summary indicates that the graph is undirected (U), named (N), weighted (W), and doesn’t contain loops or multiple edges (-). The name of the dataset is ‘Zachary’s karate club network’. The graph has 34 nodes and 78 edges.

There are 3 graph level attributes (g): name, citation, and author. All three of these attributes are character types (c).

There are 4 vertex level attributes (v): faction, name, label, and color. Faction and color are numeric (n). Name and label are character type.

There is only 1 edge level attribute (e): weight, which is numeric.

The graph shows 2 major clusters, one around A and one around H. From the vertex and graph information, the H represents ‘Mr. Hi’, the A represents ‘John A’, and each number represents an actor. The subgraphs show that some of the actors cluster around John and some cluster around Mr. Hi.

Problem 3

This problem requires to to create an interactive plot using plotly. If you want to display properly the plot in your HW answers, you may well need to set your HW document as an html file (instead of doc, docx or pdf file) when you compile your R codes.

Please use the mpg data set we have discussed in the lectures. Create an interactive, scatter plot between “highway miles per gallon” hwy (on the y-axis) and “engine displacement in litres” displ (on the x-axis) with the color aesthetic designated by “number of cylinders” cyl, and set the x-axis label as “engine displacement in litres” and y-axis label as “highway miles per gallon”. You need to check the object type for cyl and set it correctly when creating the plot. Add the title “# of cylinders” to the legend and adjust the vertical position of the legend, if you can. For the last, you may look through https://plotly.com/r/legend/ for help.

#converting cyl to a factor
mpg$cyl <- as.factor(mpg$cyl)

#creating the plot
p3 <- plot_ly(data = mpg, 
             x = ~displ, 
             y = ~hwy, 
             type = 'scatter',
             mode = 'markers',
             color = ~cyl) %>%
  layout(title = "Highway Miles per Gallon vs. Engine Displacement",
         xaxis = list(title = "Engine Displacement in Litres"),
         yaxis = list(title = "Highway Miles per Gallon"),
         legend = list(title = list(text = "# of cylinders"),
                       y = 1))

#displaying the plot
p3

The plot displays the mpg of vehicles with different cylinders and engine displacements. The colors show the cylinders of the vehicles, which indicate a trend that vehicles with higher engine displacement and cylinders have lower mpg. Looking at each of the # of cylinders individually indicates that there is not a strong relationship between mpg and engine displacement when # of cylinders is held constant.