Exercise R- For data science Chapter-10

Author

P K Parida

Exercise-10

library(tidyverse)

glimpse(diamonds)

Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

summary(diamonds)

     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                                                  
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

Explore the distribution of each of the `x`, `y`, and `z` variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

ggplot(diamonds, mapping= aes(x = x)) +
  geom_density() + geom_rug()

geom_rug() shows a 1-dimensional distribution at the bottom.

ggplot(diamonds, mapping = aes(x = y)) +
  geom_density() + geom_rug()

ggplot(diamonds, mapping = aes(x = z)) +
  geom_density() + geom_rug()

ggplot(diamonds) +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = z), color = "blue")

Seems like x and y should be length and width, and z is depth.

2. Explore the distribution of `price`. Do you discover anything unusual or surprising? (Hint: Carefully think about the `binwidth` and make sure you try a wide range of values.)

diamonds <- diamonds %>% filter(2 < y & y < 20 & 2 < x & 2 < z & z < 20)
ggplot(diamonds) + 
  geom_freqpoly(aes(x = price), binwidth = 10) +
  xlim(c(1000, 2000))

ggplot(diamonds) + 
  geom_freqpoly(aes(x = price), binwidth = 20)

Unusally there are no dimaonds in the price range of around 1,500. There is also a surge of number of diamonds in the price range of raound 4,500.

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>% filter(carat == 0.99) %>% count()

# A tibble: 1 × 1
      n
  <int>
1    23

diamonds %>% filter(carat == 1) %>% count()

# A tibble: 1 × 1
      n
  <int>
1  1556

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  xlim(c(0.97, 1.03))

4. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when zooming in on a histogram. What happens if you leave `binwidth` unset? What happens if you try and zoom so only half a bar shows?

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  xlim(c(0.97, 1.035))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  coord_cartesian(xlim = c(0.97, 1.035))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  coord_cartesian(ylim = c(0, 2000))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  xlim(c(0.97, 1.035))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  coord_cartesian(xlim = c(0.97, 1.035))

10.4.1 Exercises

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
In a bar chart, NA is considered as just another category. In a histogram, NA is ignored because the x exis has order.

set.seed(0)
df <- tibble(norm = rnorm(100)) %>% mutate(inrange = ifelse(norm > 2, NA, norm))
ggplot(df) +
  geom_histogram(aes(x = inrange))

df <- diamonds %>% mutate(cut = as.factor(ifelse(y > 7, NA, cut)))
ggplot(df) + geom_bar(aes(x = cut))

10. 5.1.1 Exercises

Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights

library(nycflights13)
flights %>% 
  mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>% 
  ggplot() +
  geom_boxplot(aes(x = cancelled, y = dep_time))

flights %>% 
  mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>% 
  filter(cancelled) %>% 
  select(dep_time)

# A tibble: 8,713 × 1
   dep_time
      <int>
 1     2016
 2       NA
 3       NA
 4       NA
 5       NA
 6     2041
 7     2145
 8       NA
 9       NA
10       NA
# ℹ 8,703 more rows

Puzzled by this question: how do we have departure times of cancelled flights?

2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

ggplot(diamonds) +
  geom_point(aes(x = carat, y = price), color = "blue", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = depth, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = table, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = x, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = z, y = price), color = "red", alpha = 0.5)

Volumn and weight are two variables that is most important for predicting the price. Since volumn is highly correlated with weight, they can be considered to be one variable.

ggplot(diamonds) +
  geom_boxplot(aes(x = cut, y = carat))

Because better cut has lower carat which makes their price lower, so if we don’t look at carat, it would appear that better cut has lower price.

3. Install the `ggstance` package, and create a horizontal boxplot. How does this compare to using `coord_flip()`?

library(ggstance)

ggplot(diamonds) + geom_boxplot(aes(x = cut, y = carat)) + coord_flip()

ggplot(diamonds) + geom_boxploth(aes(x = carat, y = cut))

4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the `lvplot` package, and try using `geom_lv()` to display the distribution of `price` vs `cut`. What do you learn? How do you interpret the plots?

library(lvplot)
ggplot(diamonds) + geom_lv(aes(x = cut, y = price))

While the boxplot only shows a few quantiles and outliers, the letter-value plot shows many quantiles.

6. Compare and contrast `geom_violin()` with a facetted `geom_histogram()`, or a coloured `geom_freqpoly()`. What are the pros and cons of each method?

ggplot(diamonds) +
  geom_histogram(aes(x = price)) +
  facet_wrap(~cut)

ggplot(diamonds) +
  geom_freqpoly(aes(x = price)) +
  facet_wrap(~cut)

ggplot(diamonds) +
  geom_violin(aes(x = cut, y = price))

ggplot(diamonds) +
  geom_lv(aes(x = cut, y = price))

The echo: false option disables the printing of code (only output is displayed).

Exercise-10

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)