This data frame has flight information for flights arriving and departing New York City (NYC) in 2013.

Scatterplot Example

Graphically investigate the relationship between the following two numerical variables in the flights data frame:

dep_delay: departure delay on the horizontal “x” axis and arr_delay: arrival delay on the vertical “y” axis

for Alaska Airlines flights leaving NYC in 2013.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.4.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
#filter data for Alaska Airlines only
all_alaska_flights <- flights %>% filter(carrier == "AS")
## Warning: package 'bindrcpp' was built under R version 3.4.4
#Scatterplot
#Scatterplot #1
ggplot(data = all_alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))+
  geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).

#Scatterplot #2
#trying to relieve overplotting in scatterplot with the alpha argument
ggplot(data = all_alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2)
## Warning: Removed 5 rows containing missing values (geom_point).

#Scatterplot #3
#adding jitter to the scatterplot to relieve overfitting
ggplot(data = all_alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_jitter(width = 30, height = 30)
## Warning: Removed 5 rows containing missing values (geom_point).

Scatterplot Questions

  1. Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?

From the scatterplot#1, we see there is a overplotting near [0,0] coordinates. This can be due to values plotted on each other. So, we can adjust the transparency of these values with alpha function to understand how the values are positioned.By changing the alpha value in geom() function, the transparency varies with 0 meaning 100% transparent and 1 meaning 100% opaque. In scatterplot#2, by setting the alpha at 0.2 we could those areas where the frequency of the values are high and less dark at areas of lower degree.

  1. After viewing Scatterplot #2, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Scatterplot #1?

The approximate range of arrival delays and departure delays do not change when the alpha is set to 0.2 as the value represent the difference between the smallest and largest value recorded. Arrival delay range = 275, departure delay range = 290

Linegraphs

Create a linegraph of hourly temperature at the Newark, NJ airport using the weather data frame. We will disucss the weather data frame in class.

library(dplyr)
library(nycflights13)
library(ggplot2)

#filter Newark temperature in fist 15 days of January
early_january_weather <- weather %>% filter(origin == "EWR" & month == 1 & day <= 15)

#Linegraph of hourly temperature using geom_line()
ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) +
  geom_line()

Linegraph Question

  1. Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?

Linegraphs offer insights on how the two variables change with time.The temperature variable values should be calculated at uniform time periods for plotting the graph. The distribution should be uniform (in secs, mins, hours, days, years, etc.) in linegraphs for effective results.

Histograms Using geom_histogram

Using the weather data frame, construct different histograms.

library(dplyr)
library(nycflights13)
library(ggplot2)

#Histogram #1
#default histogram
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

#Histogram #2
#changing number of bins using bins argument
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(bins = 60, color = "white")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

#Histogram #3
#using color to fill in bars and outline bins
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(bins = 60, color = "white", fill = "tomato3")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

#Histogram #4
#changing number of bins using binwidth argument
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 10, color = "white")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Histogram Questions

  1. What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?

Bins are the number of groups we want the data to be classified. So, from changing the #bins from 30 to 60, the distribution of temperatures is more visible. We can say that the temperature varied every hour.

  1. Would you classify the distribution of temperatures as symmetric or skewed?

Symmetric as there is inverse U curve or bell shape to the histogram plot

  1. What would you guess is the “center” value in this distribution? Why did you make that choice?

I would guess the center value in this distribution is between 60 and 70 degree Fahrenheit. In a symmetric distribution, the center is always within the mean value.

  1. Is this data spread out greatly from the center or is it close? Why?

The data is closely spread out from the center as this is symmetric distribution

Faceted Histograms

We will continue using the weather data frame for these examples.

library(dplyr)
library(nycflights13)
library(ggplot2)

#Faceted Histogram
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~ month, nrow = 4)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

##Faceted Histogram Questions 1. What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?

We can use these faceted plots to study the correlation between numeric and categorical variables. In this example, the change in temperature per month was plotted in 12 faceted plots for better understanding. I can see that there are 4 rows in the y axis of all the 12 plots.

  1. What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?

The numbers 1-12 represent the months in a year, while the x-axis denote the hourly temperature bins measured at NYC airport.

  1. For which types of data-sets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.

The faceted plots are not suitable for numeric variables to find their relationships. For example, I want to know how the temperature changed at these places with respect to rainfall in the region. This is not a good use case for this plot type. We need one categorical variable.

  1. Does the temp variable in the weather data-set have a lot of variability? Why do you say that?

The temp variable measures the hourly temperate recorded at the airport.so, there will be variation on a hourly data and this will change across every month as the weather conditions change over the year.