library(nycflights13)
library(ggplot2)
library(moderndive)
(LC2.1) Take a look at both the flights data
frame from the nycflights13 package and the
alaska_flights data frame from the moderndive
package by running View(flights) and
View(alaska_flights). In what respect do these data frames
differ? For example, think about the number of rows in each
dataset.
dim(flights)
## [1] 336776 19
dim(alaska_flights)
## [1] 714 19
We can see that flights has over 300K rows, compared to
only ~700 in alaska_flights
geom_pointggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
(LC2.2) What are some practical reasons why
dep_delay and arr_delay have a positive
relationship?
A plane leaving late will most likely arrive late with a similar
delay
(LC2.3) What variables in the weather data frame
would you expect to have a negative correlation (i.e., a negative
relationship) with dep_delay? Why? Remember that we are
focusing on numerical variables here. Hint: Explore the
weather dataset by using the View()
function.
head(weather)
## # A tibble: 6 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
I would expect pressure to have a negative correlation
as higher pressure generally mean tamer weather conditions. Planes would
be less likely to be delay if the weather is good.
(LC2.4) Why do you believe there is a cluster of points near
(0, 0)? What does (0, 0) correspond to in terms of the Alaska Air
flights?
(0,0) corresponds to no delay on either end. There being a cluster here
would make sense as we would expect the majority of flights to stay on
time.
(LC2.5) What are some other features of the plot that stand
out to you?
I notice that the slope is a bit less than 1. This tells us that planes
that leave late tend to make up some of that time back in the air.
(LC2.6) Create a new scatterplot using different variables in
the alaska_flights data frame by modifying the example
given.
ggplot(data = alaska_flights, mapping = aes(x = air_time, y = arr_delay)) +
geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2)
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30)
## Warning: Removed 5 rows containing missing values (geom_point).
(LC2.7) Why is setting the alpha argument value useful with
scatterplots? What further information does it give you that a regular
scatterplot cannot?
It allows us to see when points overlap, showing density. Points could
be hidden behind other points in a regular scatterplot.
(LC2.8) After viewing Figure 2.4, give an approximate range
of arrival delays and departure delays that occur most frequently. How
has that region changed compared to when you observed the same plot
without alpha = 0.2 set in Figure 2.2?
X: (-25,0) Y: (-50, 0) is where the majority of dots are concentrated.
It hasn’t really changed all that much, it is just clearer the density
is significantly greater in the region.
(LC2.9) Take a look at both the weather data frame from the
nycflights13 package and the early_january_weather data frame from the
moderndive package by running View(weather) and
View(early_january_weather). In what respect do these data frames
differ?
dim(weather)
## [1] 26115 15
dim(early_january_weather)
## [1] 358 15
We can see that weather has over 26K rows, compared to
only ~350 in early_january_weather
(LC2.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement, whereas the hour variable does not?
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The time_hour column shows the date and time stamp for
the flight (I’m guessing when it was scheduled to take off), whereas the
hour column only contains the number of hours of flight
time.
geom_lineggplot(data = early_january_weather,
mapping = aes(x = time_hour, y = temp)) +
geom_line()
(LC2.11) Why should linegraphs be avoided when there is not a
clear ordering of the horizontal axis?
Line graphs really only make sense if the information in them is
ordered.
(LC2.12) Why are linegraphs frequently used when time is the
explanatory variable on the x-axis?
It is very useful to see a ‘change over time’ line.
(LC2.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013.
ggplot(data = early_january_weather,
mapping = aes(x = time_hour, y = wind_speed)) +
geom_line()
geom_histogramggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(color = "white", fill = "steelblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 40, color = "white")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 10, color = "white")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
(LC2.14) What does changing the number of bins from 30 to 40
tell us about the distribution of temperatures?
We see that the distribution isn’t as smooth, there are spikes at
certain temperatures.
(LC2.15) Would you classify the distribution of temperatures
as symmetric or skewed in one direction or another?
I would say it is relatively symmetrical.
(LC2.16) What would you guess is the “center” value in this
distribution? Why did you make that choice?
Around 55. I chose that because it is right in the middle of a
symmetrical distribution.
(LC2.17) Is this data spread out greatly from the center or
is it close? Why?
It is close, but not super close. There is a large range from 30-80
where there is most of the data. But there really isn’t more at the
middle of 55 than at 35 or 75.
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month, nrow = 4)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
(LC2.18) What other things do you notice about this faceted
plot? How does a faceted plot help us see relationships between two
variables?
We can see the average temperature change throughout the year.
A faceted plot can add in another variable into our analysis, showing
the change in the distribution of x over
y.
(LC2.19) What do the numbers 1-12 correspond to in the plot?
What about 25, 50, 75, 100?
1-12 represent the 12 months of the year (1 is January, 2 is
Febuary, etc.). 25, 50, 75, and 100 represent the temperature of the
datapoints.
(LC2.20) For which types of datasets would faceted plots not
work well in comparing relationships between variables? Give an example
describing the nature of these variables and other important
characteristics.
Datasets where both variables are numerical. Faceted plots
require one categorical variable to divide the plots of the other
numerical variable into.
(LC2.21) Does the temp variable in the
weather dataset have a lot of variability? Why do you say
that?
Yes, the distributions look considerably different throughout
the year.
geom_boxplotggplot(data = weather, mapping = aes(x = month, y = temp)) +
geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
(LC2.22) What does the dot at the bottom of the plot for May correspond
to? Explain what might have occurred in May to produce this point.
A very low outlier. Maybe a random snow day happened? Not
really sure.
(LC2.23) Which months have the highest variability in
temperature? What reasons can you give for this?
January, November and December. Their boxes are the
tallest.
(LC2.24) We looked at the distribution of the numerical
variable temp split by the numerical variable
month that we converted using the factor()
function in order to make a side-by-side boxplot. Why would a boxplot of
temp split by the numerical variable pressure
similarly converted to a categorical variable using the
factor() not be informative?
Pressure is not a categorical variable. It would split each
invidual value of pressure into its own chart, which could
be hundred of charts each with very little data.
(LC2.25) Boxplots provide a simple way to identify outliers.
Why may outliers be easier to identify when looking at a boxplot instead
of a faceted histogram?
Boxplots set a clear definition for outliers (more than 1.5x
IQR). They also visually highlight them very clearly.
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar()
(LC2.26) Why are histograms inappropriate for categorical
variables?
In a histrogram you are measuring a numerical variable, not a
categorical one.
(LC2.27) What is the difference between histograms and
barplots?
Histograms measure a numerical variable’s frequency, where
barplots measure a categorical variable’s frequency.
(LC2.28) How many Envoy Air flights departed NYC in
2013?
~55,000
(LC2.29) What was the 7th highest airline for departed
flights from NYC in 2013? How could we better present the table to get
this answer quickly?
”US”. We could sort the bars by value.
(LC2.30) Why should pie charts be avoided and replaced by
barplots?
Humans are really bad at accurately reading piecharts.
(LC2.31) Why do you think people continue to use pie
charts?
They look pretty.
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar()
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar(position = position_dodge(preserve = "single"))
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar() +
facet_wrap(~ origin, ncol = 1)
(LC2.32) What kinds of questions are not easily answered by
looking at Figure 2.23?
Comparisons over airports between carriers. This is because
they are not aligned, so it is very hard to compare.
(LC2.33) What can you say, if anything, about the
relationship between airline and airport in NYC in 2013 in regards to
the number of departing flights?
The vast majority of flights out of EWR were through EV and
UA.
(LC2.34) Why might the side-by-side barplot be preferable to
a stacked barplot in this case?
It is easier to compare the bars.
(LC2.35) What are the disadvantages of using a dodged
barplot, in general?
The total value of all the bars for a category is not able to
be visualized.
(LC2.36) Why is the faceted barplot preferred to the
side-by-side and stacked barplots in this case?
Because the charts line up, it is very easy to compare by
looking at the same carrier across the other charts above and below.
(LC2.37) What information about the different carriers at
different airports is more easily seen in the faceted barplot?
The number of carriers who operate in each airport.
| Named graph | Shows | Geometric object | Notes | |
|---|---|---|---|---|
| 1 | Scatterplot | Relationship between 2 numerical variables | geom_point() |
|
| 2 | Linegraph | Relationship between 2 numerical variables | geom_line() |
Used when there is a sequential order to x-variable, e.g., time |
| 3 | Histogram | Distribution of 1 numerical variable | geom_histogram() |
Facetted histograms show the distribution of 1 numerical variable split by the values of another variable |
| 4 | Boxplot | Distribution of 1 numerical variable split by the values of another variable | geom_boxplot() |
|
| 5 | Barplot | Distribution of 1 categorical variable | geom_bar() when counts are not
pre-counted, geom_col() when counts are pre-counted |
Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables |
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar()
ggplot(flights, aes(x = carrier)) +
geom_bar()