Wk 4 Assignment: Data visualisation from the Hands-on Programming with R
title: "Untitled" author: "Suma Pendyala" date: "6/21/2020" output: html_document
Chapter 7 - Exploratory Data Analysis
1 - Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
library(tidyverse)
ggplot(data = diamonds, mapping = aes(x = x)) + geom_density() + geom_rug() + labs(title = 'Distribution of x(length)')

ggplot(data = diamonds, mapping = aes(x = y)) + geom_density() + geom_rug() + labs(title = 'Distribution of y(width)')

ggplot(data = diamonds, mapping = aes(x = z)) + geom_density() + geom_rug() + labs(title = 'Distribution of z(depth)')

2 - Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = price), binwidth = 20)

3 - How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>% filter(between(carat, .96, 1.05)) %>% group_by(carat) %>% summarize(count = n())
## # A tibble: 10 x 2 ## carat count ## <dbl> <int> ## 1 0.96 103 ## 2 0.97 59 ## 3 0.98 31 ## 4 0.99 23 ## 5 1 1558 ## 6 1.01 2242 ## 7 1.02 883 ## 8 1.03 523 ## 9 1.04 475 ## 10 1.05 361
4 - Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = price), binwidth = 20) + coord_cartesian(xlim = c(0,5000), ylim = c(0,700))

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = price), binwidth = 20) + xlim(c(0,5000)) + ylim(c(0,700))
## Warning: Removed 14714 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).

7.4.1 Exercises
data.frame(value = c(NA, NA, NA, rnorm(1000,0,1))) %>% ggplot() + geom_histogram(mapping = aes(x = value), bins = 50)
## Warning: Removed 3 rows containing non-finite values (stat_bin).

ggplot(data = data.frame(type = c('A','A','B','B','B',NA))) + geom_bar(mapping = aes(x = type))

2 - What does na.rm = TRUE do in mean() and sum()?
mean(c(1,2,3,NA,4), na.rm = TRUE)
## [1] 2.5
7.5.1.1 Exercises
nycflights13::flights %>% mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60 ) %>% ggplot(mapping = aes(sched_dep_time)) + geom_density(mapping = aes(colour = cancelled))

nycflights13::flights %>% mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60 ) %>% ggplot() + geom_boxplot(mapping = aes(x = cancelled, y = sched_dep_time))

2 - What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
diamonds %>% mutate(cut = as.numeric(cut), color = as.numeric(color), clarity = as.numeric(clarity)) %>% select(price, everything()) %>% cor()
## price carat cut color clarity depth ## price 1.00000000 0.92159130 -0.05349066 0.17251093 -0.14680007 -0.01064740 ## carat 0.92159130 1.00000000 -0.13496702 0.29143675 -0.35284057 0.02822431 ## cut -0.05349066 -0.13496702 1.00000000 -0.02051852 0.18917474 -0.21805501 ## color 0.17251093 0.29143675 -0.02051852 1.00000000 0.02563128 0.04727923 ## clarity -0.14680007 -0.35284057 0.18917474 0.02563128 1.00000000 -0.06738444 ## depth -0.01064740 0.02822431 -0.21805501 0.04727923 -0.06738444 1.00000000 ## table 0.12713390 0.18161755 -0.43340461 0.02646520 -0.16032684 -0.29577852 ## x 0.88443516 0.97509423 -0.12556524 0.27028669 -0.37199853 -0.02528925 ## y 0.86542090 0.95172220 -0.12146187 0.26358440 -0.35841962 -0.02934067 ## z 0.86124944 0.95338738 -0.14932254 0.26822688 -0.36695200 0.09492388 ## table x y z ## price 0.1271339 0.88443516 0.86542090 0.86124944 ## carat 0.1816175 0.97509423 0.95172220 0.95338738 ## cut -0.4334046 -0.12556524 -0.12146187 -0.14932254 ## color 0.0264652 0.27028669 0.26358440 0.26822688 ## clarity -0.1603268 -0.37199853 -0.35841962 -0.36695200 ## depth -0.2957785 -0.02528925 -0.02934067 0.09492388 ## table 1.0000000 0.19534428 0.18376015 0.15092869 ## x 0.1953443 1.00000000 0.97470148 0.97077180 ## y 0.1837601 0.97470148 1.00000000 0.95200572 ## z 0.1509287 0.97077180 0.95200572 1.00000000
diamonds_con % mutate(cut = as.numeric(cut), color = as.numeric(color), clarity = as.numeric(clarity)) summary(lm(price ~ carat + cut + carat*cut, data = diamonds_con))
## Error: <text>:1:14: unexpected input ## 1: diamonds_con % ## ^
3 - Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
nycflights13::flights %>% mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60 ) %>% ggplot() + geom_boxplot(mapping = aes(x = cancelled, y = sched_dep_time)) + coord_flip()

library(ggstance)
## Error in library(ggstance): there is no package called 'ggstance'
nycflights13::flights %>% mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60 ) %>% ggplot() + geom_boxploth(mapping = aes(y = cancelled, x = sched_dep_time))
## Error in geom_boxploth(mapping = aes(y = cancelled, x = sched_dep_time)): could not find function "geom_boxploth"
4 - One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values". One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
4 - One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values". One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
## Error: <text>:1:9: unexpected symbol ## 1: 4 - One problem ## ^
ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = cut, y = price))

library(lvplot)
## Error in library(lvplot): there is no package called 'lvplot'
ggplot(data = diamonds) + geom_lv(mapping = aes(x = cut, y = price))
## Error in geom_lv(mapping = aes(x = cut, y = price)): could not find function "geom_lv"
5 - Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
diamonds %>% ggplot() + geom_histogram(mapping = aes(x = price), binwidth = 50) + facet_grid(cut~.)

diamonds %>% ggplot() + geom_violin(mapping = aes(x = cut, y = price))

diamonds %>% ggplot() + geom_freqpoly(mapping = aes(x = price, color = cut), binwidth = 50)

6 - If you have a small dataset, it's sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
ggplot(data = mpg) + geom_jitter(mapping = aes(x = drv, y = displ))

library(ggbeeswarm)
## Error in library(ggbeeswarm): there is no package called 'ggbeeswarm'
ggplot(data = mpg) + geom_beeswarm(mapping = aes(x = drv, y = displ), priority = 'ascending')
## Error in geom_beeswarm(mapping = aes(x = drv, y = displ), priority = "ascending"): could not find function "geom_beeswarm"
7.5.2.1 Exercises
1 - How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
diamonds %>% count(color, cut) %>% group_by(color) %>% mutate(prop = n / sum(n)) %>% ggplot() + geom_tile(mapping = aes(x = color, y = cut, fill = prop)) + labs(title = 'Distribution of cut within color')

2 - Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
nycflights13::flights %>% group_by(dest, month) %>% summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% ggplot() + geom_tile(mapping = aes(x = month, y = dest, fill = avg_dep_delay))

nycflights13::flights %>% group_by(dest, month) %>% summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% ungroup() %>% group_by(dest) %>% mutate(n_month = n())%>% ggplot() + geom_tile(mapping = aes(x = factor(month), y = reorder(dest, n_month), fill = avg_dep_delay)) + scale_fill_gradient2(low = 'yellow', mid = 'orange', high = 'red', midpoint = 35)

3 - Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
diamonds %>% count(color, cut) %>% group_by(color) %>% mutate(prop = n / sum(n)) %>% ggplot() + geom_tile(mapping = aes(x = cut, y = color, fill = prop)) + labs(title = 'Distribution of cut within color')

7.5.3.1 Exercises
1 - Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
diamonds %>% ggplot() + geom_freqpoly(mapping = aes(x = price, color = cut_width(carat, .2)), bins = 30)

diamonds %>% ggplot() + geom_freqpoly(mapping = aes(x = price, color = cut_width(carat, .4)), bins = 30)

diamonds %>% ggplot() + geom_freqpoly(mapping = aes(x = price, color = cut_number(carat, 10)), bins = 30)

2 - Visualise the distribution of carat, partitioned by price.
diamonds %>% ggplot() + geom_density(mapping = aes(x = carat, color = cut_width(price, 5000, boundary = 0)))

3 - How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?
diamonds %>% ggplot + geom_boxplot(mapping = aes(x = cut_number(carat, 10), y = price)) + coord_flip()

4 - Combine two of the techniques you've learned to visualise the combined distribution of cut, carat, and price.
diamonds %>% ggplot() + geom_boxplot(mapping = aes(x = cut, y = price, color = cut_number(carat, 5)))

diamonds %>% mutate(carat_group = cut_number(carat, 10)) %>% group_by(cut, carat_group) %>% summarize(avg_price = mean(price)) %>% ggplot() + geom_tile(mapping = aes(x = cut, y = carat_group, fill = avg_price))

diamonds %>% ggplot() + geom_bin2d(mapping = aes(x = carat, y = price)) + facet_grid(cut~.)

ggplot(data = diamonds) + geom_point(mapping = aes(x = x, y = y)) + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

ggplot(data = diamonds) + geom_bin2d(mapping = aes(x = x, y = y), bins = 800) + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
