MASTER_DATA_SCIENCE

DATA TRANSFORMATION WEEK 4:

7.3 VARIATIONS 7.3.4 Exercises

1.Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

#distribution of x:
ggplot(data = diamonds, mapping = aes(x = x)) +
  geom_density() + 
  geom_rug() +
  labs(title = 'Distribution of x(length)')

#distribution of y:
ggplot(data = diamonds, mapping = aes(x = y)) +
  geom_density() + 
  geom_rug() +
  labs(title = 'Distribution of y(width)')

#distribution of z:
ggplot(data = diamonds, mapping = aes(x = z)) +
  geom_density() + 
  geom_rug() +
  labs(title = 'Distribution of z(depth)')

Generally, we see there are more smaller diamonds than bigger ones

2.Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 10)

We observe that most diamonds are of prices less than 10000, there is rarely any diamonds of price 15000 and above.

3.How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>% filter(between(carat, .96, 1.05)) %>%
  group_by(carat) %>% summarize(count = n())

## # A tibble: 10 x 2
##    carat count
##    <dbl> <int>
##  1  0.96   103
##  2  0.97    59
##  3  0.98    31
##  4  0.99    23
##  5  1     1558
##  6  1.01  2242
##  7  1.02   883
##  8  1.03   523
##  9  1.04   475
## 10  1.05   361

The difference may arise from the tendency of rounding up values with decimals as the 1ct diamonds are more than the 0.99ct diamonds.

4.Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

#using coord_cartesian()
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 20) +
  coord_cartesian(xlim = c(0,4000), ylim = c(0,500))

#using xlim() and ylim()
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 20) +
  xlim(c(0,4000)) +
  ylim(c(0,500))

## Warning: Removed 19379 rows containing non-finite values (stat_bin).

## Warning: Removed 15 rows containing missing values (geom_bar).

using coor_cartesian plot values beyond the limits,while xlim and ylim functions do not.

7.4.1 Exercises

1.What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

In a histogram , the missing values are excluded while in a bar chart, the miising values are included in the plotting.

#missing value in a histogram
data.frame(value = c(NA, NA, rnorm(1000,0,1))) %>% ggplot() +
  geom_histogram(mapping = aes(x = value), bins = 30)

## Warning: Removed 2 rows containing non-finite values (stat_bin).

#missing value in a bargraph
ggplot(data = data.frame(type = c('A','A','B',NA))) + 
  geom_bar(mapping = aes(x = type))

2.What does na.rm = TRUE do in mean() and sum()? When there are missing values in the vector, mean() and sum() will return NA. By including na.rm = TRUE, mean() and sum() will return the average and sum based on the non-missing values in the vector

mean(c(10,20,87,NA,98), na.rm = TRUE)

## [1] 53.75

MASTER_DATA_SCIENCE_R

SHAMIM RASHID

12/8/2019