Load the library:

library(ggplot2)

Exercise 1: The diamonds data.frame is included in the ggplot2 package. Study the relationships between caret, price and color. We know that price depends somewhat on carat, but does this dependence vary by color? Produce the plot or plots that you feel best communicates the relationship and then describe in words what you see.

str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
table(diamonds$color)
## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808

Since color is a factor variable with 7 levels, we can plot carat vs. price with color represented by various colors. Also, since the diamonds data frame is large, lets take a random sample of 100 records.

set.seed(123)
dsample <- diamonds[sample(nrow(diamonds), 100), ]

Plot the points from the sample.

cpc0 <- ggplot(data = dsample, aes(carat, price, color)) 
cpc1 <- cpc0 + geom_point(aes(color = color), size = 3)
cpc1 

Now we add a differentiable curve and regression line.

cpc2 <- cpc1 + geom_smooth()
cpc2 + ggtitle("Differentiable Curve")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

cpc3 <- cpc1 + geom_smooth(method = "lm", se = FALSE)
cpc3 + ggtitle("Regression Line")

We can observe an apparent linear relationship between price and carat but it is difficult to discern if any relationship exists between price and color with the above graphs. Therefore,we will explore the impact of color using geom_jitter using the full diamonds data frame.

Note: We will explore the relationship color has with carat, price, as well as price/carat.

For Color and Carat:

cpc_carat <- ggplot(data = diamonds, aes(x = color, y = carat)) 
cpc_carat + geom_jitter(alpha = 0.1)

Adding boxplots:

cpc_carat + geom_boxplot(aes(fill = color))

From both of the above plots, we notice that there may be a relationship between color and carat with higher carats boasting a higher (i.e further from the start of the alphabet) color. To explore this, we can look at the density plot for carat with each color.

cpc_carat_den <- ggplot(data = diamonds, aes(x = carat))
cpc_carat_den + geom_density(aes(color = color)) + scale_x_continuous(limits = c(0,3))
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 6 rows containing non-finite values (stat_density).
## Warning: Removed 13 rows containing non-finite values (stat_density).
## Warning: Removed 9 rows containing non-finite values (stat_density).

The density plot is consistent with our earlier observation suggesting a relationship.

Now we investigate the relationship between color and price.

For Color and Price:

cpc_price <- ggplot(data = diamonds, aes(x = color, y = price)) 
cpc_price + geom_jitter(alpha = 0.2, color = "blue")

Adding Boxplots:

cpc_price + geom_boxplot(aes(fill = color))

Similar to our observations of color and carat, we notice that higher prices are associated with the colors further away from the start of the alphabet. This is what we might expect given our observations about the relationship between color and carat. The following density plot of price should also be consistent with the density plot of carat.

cpc_price_den <- ggplot(data = diamonds, aes(x = price))
cpc_price_den + geom_density(aes(color = color)) + scale_x_continuous(limits = c(0,17000))
## Warning: Removed 49 rows containing non-finite values (stat_density).
## Warning: Removed 67 rows containing non-finite values (stat_density).
## Warning: Removed 113 rows containing non-finite values (stat_density).
## Warning: Removed 146 rows containing non-finite values (stat_density).
## Warning: Removed 165 rows containing non-finite values (stat_density).
## Warning: Removed 132 rows containing non-finite values (stat_density).
## Warning: Removed 45 rows containing non-finite values (stat_density).

Though not as clear, we see that colors D, E, F, and G are peaked at a lower price than the remaining colors of H, I, and J, suggesting a relationship between color and price.

Since carat and price appear to have a linear relationship, we expect price/carat to illustrate a similar relationship. We plot the same graphs as above to be consistent.

For Color and Price/Carat:

cpc_pc <- ggplot(data = diamonds, aes(x = color, y = price/carat)) 
cpc_pc + geom_jitter(alpha = 0.2, color = "green")

Adding Boxplot:

cpc_pc + geom_boxplot(aes(fill = color))

Density:

cpc_pc_den <- ggplot(data = diamonds, aes(x = price/carat))
cpc_pc_den + geom_density(aes(color = color)) + scale_x_continuous(limits = c(0,12000))
## Warning: Removed 47 rows containing non-finite values (stat_density).
## Warning: Removed 19 rows containing non-finite values (stat_density).
## Warning: Removed 16 rows containing non-finite values (stat_density).
## Warning: Removed 2 rows containing non-finite values (stat_density).

Based on all of the above plots, we observe that color does have a relationship with price and carat.

Exercise 2: This exercise uses the Houston flight data. Here, we’ll use a version of the data set that is available in the package hflights. Install the package hflights and then execute.

library(hflights)
str(hflights)
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

Study the relationship between arrival delay (ArrDelay), arrival time (ArrTime) and day of week (DayOfWeek). Are delays more likely at certain times of day or days of week? Note: transform the data to not allow negative arrival delays.

For Day of the Week:

d <- ggplot(data = hflights, aes(x = factor(DayOfWeek), y = ArrDelay)) + geom_point()
d + xlab("Day of Week") + ylab("Arrival Delay")
## Warning: Removed 3622 rows containing missing values (geom_point).

Adding Boxplots:

d_box <- ggplot(data = hflights, aes(x = factor(DayOfWeek), y = ArrDelay)) + geom_boxplot()
d_box + xlab("Day of Week") + ylab("Arrival Delay")
## Warning: Removed 3622 rows containing non-finite values (stat_boxplot).

Based on the above two plots, we do not see a discernable difference between arrival delays and the day of the week. At first glance, it appears as if there is no relationship between arrival delays and day of the week.

Density:

d_den <- ggplot(data = hflights, aes(x = ArrDelay))
d_den + geom_density(aes(color = factor(DayOfWeek))) + scale_x_continuous(limits = c(0,250))
## Warning: Removed 16595 rows containing non-finite values (stat_density).
## Warning: Removed 16452 rows containing non-finite values (stat_density).
## Warning: Removed 17077 rows containing non-finite values (stat_density).
## Warning: Removed 16688 rows containing non-finite values (stat_density).
## Warning: Removed 17447 rows containing non-finite values (stat_density).
## Warning: Removed 13534 rows containing non-finite values (stat_density).
## Warning: Removed 16159 rows containing non-finite values (stat_density).

Based on the density plot it seems to confirm that there is not a relationship between arrival delays and day of the week.

For Arrival Time:

t_jitter <- ggplot(data = hflights, aes(x = ArrTime, y = ArrDelay)) + geom_jitter(alpha=0.1)
t_jitter + xlab("Arrival Time") + ylab("Arrival Delay")
## Warning: Removed 3622 rows containing missing values (geom_point).

t <- ggplot(data = hflights, aes(x = ArrTime, y = ArrDelay))
t + geom_point() + geom_smooth(method = "lm")
## Warning: Removed 3622 rows containing missing values (stat_smooth).
## Warning: Removed 3622 rows containing missing values (geom_point).

We notice that the regression line is flat and both of the above graphs suggest that there is no relationship between arrival times and arrival delays.

We will take a sample to declutter the graphs and confirm the lack of relationship betweent the variables.

set.seed(111)
fsample <- hflights[sample(nrow(hflights), 1000), ]

t_jitter_sample <- ggplot(data = fsample, aes(x = ArrTime, y = ArrDelay)) + geom_jitter()
t_jitter_sample + xlab("Arrival Time") + ylab("Arrival Delay") + ggtitle("Random Sample")
## Warning: Removed 14 rows containing missing values (geom_point).

t_sample <- ggplot(data = fsample, aes(x = ArrTime, y = ArrDelay))
t_sample + geom_point() + geom_smooth(method = "lm")
## Warning: Removed 14 rows containing missing values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).

The sample confirms the appearance of no relationship between arrival delays and arrival times.

Note: We could use facet_wrap to work through both exercises. However, since there are 7 categories in the categorical variables it would be a bit difficult to see any patterns without combining any of the categories for my taste.