DATA TRANSFORMATION WEEK 4:
7.3 VARIATIONS 7.3.4 Exercises
1.Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#distribution of x:
ggplot(data = diamonds, mapping = aes(x = x)) +
geom_density() +
geom_rug() +
labs(title = 'Distribution of x(length)')
#distribution of y:
ggplot(data = diamonds, mapping = aes(x = y)) +
geom_density() +
geom_rug() +
labs(title = 'Distribution of y(width)')
#distribution of z:
ggplot(data = diamonds, mapping = aes(x = z)) +
geom_density() +
geom_rug() +
labs(title = 'Distribution of z(depth)')
Generally, we see there are more smaller diamonds than bigger ones
2.Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 10)
We observe that most diamonds are of prices less than 10000, there is rarely any diamonds of price 15000 and above.
3.How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>% filter(between(carat, .96, 1.05)) %>%
group_by(carat) %>% summarize(count = n())
## # A tibble: 10 x 2
## carat count
## <dbl> <int>
## 1 0.96 103
## 2 0.97 59
## 3 0.98 31
## 4 0.99 23
## 5 1 1558
## 6 1.01 2242
## 7 1.02 883
## 8 1.03 523
## 9 1.04 475
## 10 1.05 361
The difference may arise from the tendency of rounding up values with decimals as the 1ct diamonds are more than the 0.99ct diamonds.
4.Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
#using coord_cartesian()
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 20) +
coord_cartesian(xlim = c(0,4000), ylim = c(0,500))
#using xlim() and ylim()
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 20) +
xlim(c(0,4000)) +
ylim(c(0,500))
## Warning: Removed 19379 rows containing non-finite values (stat_bin).
## Warning: Removed 15 rows containing missing values (geom_bar).
using coor_cartesian plot values beyond the limits,while xlim and ylim functions do not.
7.4.1 Exercises
1.What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
In a histogram , the missing values are excluded while in a bar chart, the miising values are included in the plotting.
#missing value in a histogram
data.frame(value = c(NA, NA, rnorm(1000,0,1))) %>% ggplot() +
geom_histogram(mapping = aes(x = value), bins = 30)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
#missing value in a bargraph
ggplot(data = data.frame(type = c('A','A','B',NA))) +
geom_bar(mapping = aes(x = type))
2.What does na.rm = TRUE do in mean() and sum()? When there are missing values in the vector, mean() and sum() will return NA. By including na.rm = TRUE, mean() and sum() will return the average and sum based on the non-missing values in the vector
mean(c(10,20,87,NA,98), na.rm = TRUE)
## [1] 53.75