Chapter 2 from Statistical Inference via Data Science

Importing Packages

library(nycflights13)
library(ggplot2)
library(moderndive)

2.3 5NG#1: Scatterplots

(LC2.1) Take a look at both the flights data frame from the nycflights13 package and the alaska_flights data frame from the moderndive package by running View(flights) and View(alaska_flights). In what respect do these data frames differ? For example, think about the number of rows in each dataset.

dim(flights)
## [1] 336776     19
dim(alaska_flights)
## [1] 714  19

We can see that flights has over 300K rows, compared to only ~700 in alaska_flights

2.3.1 Scatterplots via geom_point

ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).

(LC2.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship?
A plane leaving late will most likely arrive late with a similar delay

(LC2.3) What variables in the weather data frame would you expect to have a negative correlation (i.e., a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here. Hint: Explore the weather dataset by using the View() function.

head(weather)
## # A tibble: 6 × 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
## 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
## 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
## 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
## 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
## 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
## # … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

I would expect pressure to have a negative correlation as higher pressure generally mean tamer weather conditions. Planes would be less likely to be delay if the weather is good.

(LC2.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaska Air flights?
(0,0) corresponds to no delay on either end. There being a cluster here would make sense as we would expect the majority of flights to stay on time.

(LC2.5) What are some other features of the plot that stand out to you?
I notice that the slope is a bit less than 1. This tells us that planes that leave late tend to make up some of that time back in the air.

(LC2.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example given.

ggplot(data = alaska_flights, mapping = aes(x = air_time, y = arr_delay)) + 
  geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).

2.3.2 Overplotting

ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2)
## Warning: Removed 5 rows containing missing values (geom_point).

ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_jitter(width = 30, height = 30)
## Warning: Removed 5 rows containing missing values (geom_point).

(LC2.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?
It allows us to see when points overlap, showing density. Points could be hidden behind other points in a regular scatterplot.

(LC2.8) After viewing Figure 2.4, give an approximate range of arrival delays and departure delays that occur most frequently. How has that region changed compared to when you observed the same plot without alpha = 0.2 set in Figure 2.2?
X: (-25,0) Y: (-50, 0) is where the majority of dots are concentrated. It hasn’t really changed all that much, it is just clearer the density is significantly greater in the region.

2.4 5NG#2: Linegraphs

(LC2.9) Take a look at both the weather data frame from the nycflights13 package and the early_january_weather data frame from the moderndive package by running View(weather) and View(early_january_weather). In what respect do these data frames differ?

dim(weather)
## [1] 26115    15
dim(early_january_weather)
## [1] 358  15

We can see that weather has over 26K rows, compared to only ~350 in early_january_weather

(LC2.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement, whereas the hour variable does not?  

head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
## 1  2013     1     1      517         515       2     830     819      11 UA     
## 2  2013     1     1      533         529       4     850     830      20 UA     
## 3  2013     1     1      542         540       2     923     850      33 AA     
## 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
## 5  2013     1     1      554         600      -6     812     837     -25 DL     
## 6  2013     1     1      554         558      -4     740     728      12 UA     
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

The time_hour column shows the date and time stamp for the flight (I’m guessing when it was scheduled to take off), whereas the hour column only contains the number of hours of flight time.

2.4.1 Linegraphs via geom_line

ggplot(data = early_january_weather, 
       mapping = aes(x = time_hour, y = temp)) +
  geom_line()

(LC2.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?
Line graphs really only make sense if the information in them is ordered.

(LC2.12) Why are linegraphs frequently used when time is the explanatory variable on the x-axis?
It is very useful to see a ‘change over time’ line.

(LC2.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013.

ggplot(data = early_january_weather, 
       mapping = aes(x = time_hour, y = wind_speed)) +
  geom_line()

2.5 5NG#3: Histograms

2.5.1 Histograms via geom_histogram

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(color = "white", fill = "steelblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(bins = 40, color = "white")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 10, color = "white")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

(LC2.14) What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?
We see that the distribution isn’t as smooth, there are spikes at certain temperatures.

(LC2.15) Would you classify the distribution of temperatures as symmetric or skewed in one direction or another?
I would say it is relatively symmetrical.

(LC2.16) What would you guess is the “center” value in this distribution? Why did you make that choice?
Around 55. I chose that because it is right in the middle of a symmetrical distribution.

(LC2.17) Is this data spread out greatly from the center or is it close? Why?
It is close, but not super close. There is a large range from 30-80 where there is most of the data. But there really isn’t more at the middle of 55 than at 35 or 75.

2.6 Facets

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~ month)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~ month, nrow = 4)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

(LC2.18) What other things do you notice about this faceted plot? How does a faceted plot help us see relationships between two variables?
We can see the average temperature change throughout the year. A faceted plot can add in another variable into our analysis, showing the change in the distribution of x over y.

(LC2.19) What do the numbers 1-12 correspond to in the plot? What about 25, 50, 75, 100?
1-12 represent the 12 months of the year (1 is January, 2 is Febuary, etc.). 25, 50, 75, and 100 represent the temperature of the datapoints.

(LC2.20) For which types of datasets would faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.
Datasets where both variables are numerical. Faceted plots require one categorical variable to divide the plots of the other numerical variable into.

(LC2.21) Does the temp variable in the weather dataset have a lot of variability? Why do you say that?
Yes, the distributions look considerably different throughout the year.

2.7 5NG#4: Boxplots

2.7.1 Boxplots via geom_boxplot

ggplot(data = weather, mapping = aes(x = month, y = temp)) +
  geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).


(LC2.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
A very low outlier. Maybe a random snow day happened? Not really sure.

(LC2.23) Which months have the highest variability in temperature? What reasons can you give for this?
January, November and December. Their boxes are the tallest.

(LC2.24) We looked at the distribution of the numerical variable temp split by the numerical variable month that we converted using the factor() function in order to make a side-by-side boxplot. Why would a boxplot of temp split by the numerical variable pressure similarly converted to a categorical variable using the factor() not be informative?
Pressure is not a categorical variable. It would split each invidual value of pressure into its own chart, which could be hundred of charts each with very little data.

(LC2.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
Boxplots set a clear definition for outliers (more than 1.5x IQR). They also visually highlight them very clearly.

2.8 5NG#5: Barplots

ggplot(data = flights, mapping = aes(x = carrier)) +
  geom_bar()

(LC2.26) Why are histograms inappropriate for categorical variables?
In a histrogram you are measuring a numerical variable, not a categorical one.

(LC2.27) What is the difference between histograms and barplots?
Histograms measure a numerical variable’s frequency, where barplots measure a categorical variable’s frequency.

(LC2.28) How many Envoy Air flights departed NYC in 2013?
~55,000

(LC2.29) What was the 7th highest airline for departed flights from NYC in 2013? How could we better present the table to get this answer quickly?
”US”. We could sort the bars by value.

2.8.2 Must avoid pie charts!

(LC2.30) Why should pie charts be avoided and replaced by barplots?
Humans are really bad at accurately reading piecharts.

(LC2.31) Why do you think people continue to use pie charts?
They look pretty.

2.8.3 Two categorical variables

ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_bar()

ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_bar(position = position_dodge(preserve = "single"))

ggplot(data = flights, mapping = aes(x = carrier)) +
  geom_bar() +
  facet_wrap(~ origin, ncol = 1)

(LC2.32) What kinds of questions are not easily answered by looking at Figure 2.23?
Comparisons over airports between carriers. This is because they are not aligned, so it is very hard to compare.

(LC2.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?
The vast majority of flights out of EWR were through EV and UA.

(LC2.34) Why might the side-by-side barplot be preferable to a stacked barplot in this case?
It is easier to compare the bars.

(LC2.35) What are the disadvantages of using a dodged barplot, in general?
The total value of all the bars for a category is not able to be visualized.

(LC2.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?
Because the charts line up, it is very easy to compare by looking at the same carrier across the other charts above and below.

(LC2.37) What information about the different carriers at different airports is more easily seen in the faceted barplot?
The number of carriers who operate in each airport.

2.9 Conclusion

2.9.1 Summary table

Named graph Shows Geometric object Notes
1 Scatterplot Relationship between 2 numerical variables geom_point()
2 Linegraph Relationship between 2 numerical variables geom_line() Used when there is a sequential order to x-variable, e.g., time
3 Histogram Distribution of 1 numerical variable geom_histogram() Facetted histograms show the distribution of 1 numerical variable split by the values of another variable
4 Boxplot Distribution of 1 numerical variable split by the values of another variable geom_boxplot()
5 Barplot Distribution of 1 categorical variable geom_bar() when counts are not pre-counted, geom_col() when counts are pre-counted Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables

2.9.2 Function argument specification

ggplot(data = flights, mapping = aes(x = carrier)) +
  geom_bar()

ggplot(flights, aes(x = carrier)) +
  geom_bar()