Exercise R- For data science Chapter-10

Author

P K Parida

Exercise-10

library(tidyverse)

glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
summary(diamonds)
     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                                                  
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800  
                 
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

ggplot(diamonds, mapping= aes(x = x)) +
  geom_density() + geom_rug()

geom_rug() shows a 1-dimensional distribution at the bottom.

ggplot(diamonds, mapping = aes(x = y)) +
  geom_density() + geom_rug()

ggplot(diamonds, mapping = aes(x = z)) +
  geom_density() + geom_rug()

ggplot(diamonds) +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = z), color = "blue") 

Seems like x and y should be length and width, and z is depth.

2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

diamonds <- diamonds %>% filter(2 < y & y < 20 & 2 < x & 2 < z & z < 20)
ggplot(diamonds) + 
  geom_freqpoly(aes(x = price), binwidth = 10) +
  xlim(c(1000, 2000))

ggplot(diamonds) + 
  geom_freqpoly(aes(x = price), binwidth = 20) 

Unusally there are no dimaonds in the price range of around 1,500. There is also a surge of number of diamonds in the price range of raound 4,500.

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>% filter(carat == 0.99) %>% count()
# A tibble: 1 × 1
      n
  <int>
1    23
diamonds %>% filter(carat == 1) %>% count()
# A tibble: 1 × 1
      n
  <int>
1  1556
ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  xlim(c(0.97, 1.03))

4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  xlim(c(0.97, 1.035))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  coord_cartesian(xlim = c(0.97, 1.035))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  coord_cartesian(ylim = c(0, 2000))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  xlim(c(0.97, 1.035))

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  coord_cartesian(xlim = c(0.97, 1.035))

10.4.1 Exercises

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
In a bar chart, NA is considered as just another category. In a histogram, NA is ignored because the x exis has order.

set.seed(0)
df <- tibble(norm = rnorm(100)) %>% mutate(inrange = ifelse(norm > 2, NA, norm))
ggplot(df) +
  geom_histogram(aes(x = inrange))

df <- diamonds %>% mutate(cut = as.factor(ifelse(y > 7, NA, cut)))
ggplot(df) + geom_bar(aes(x = cut))

10. 5.1.1 Exercises

Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights

library(nycflights13)
flights %>% 
  mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>% 
  ggplot() +
  geom_boxplot(aes(x = cancelled, y = dep_time))

flights %>% 
  mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>% 
  filter(cancelled) %>% 
  select(dep_time)
# A tibble: 8,713 × 1
   dep_time
      <int>
 1     2016
 2       NA
 3       NA
 4       NA
 5       NA
 6     2041
 7     2145
 8       NA
 9       NA
10       NA
# ℹ 8,703 more rows

Puzzled by this question: how do we have departure times of cancelled flights?

2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

ggplot(diamonds) +
  geom_point(aes(x = carat, y = price), color = "blue", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = depth, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = table, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = x, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = z, y = price), color = "red", alpha = 0.5)

Volumn and weight are two variables that is most important for predicting the price. Since volumn is highly correlated with weight, they can be considered to be one variable.

ggplot(diamonds) +
  geom_boxplot(aes(x = cut, y = carat))

Because better cut has lower carat which makes their price lower, so if we don’t look at carat, it would appear that better cut has lower price.

3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?

library(ggstance)
ggplot(diamonds) + geom_boxplot(aes(x = cut, y = carat)) + coord_flip()

ggplot(diamonds) + geom_boxploth(aes(x = carat, y = cut))

4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

library(lvplot)
ggplot(diamonds) + geom_lv(aes(x = cut, y = price))

While the boxplot only shows a few quantiles and outliers, the letter-value plot shows many quantiles.

6. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

ggplot(diamonds) +
  geom_histogram(aes(x = price)) +
  facet_wrap(~cut)

ggplot(diamonds) +
  geom_freqpoly(aes(x = price)) +
  facet_wrap(~cut)

ggplot(diamonds) +
  geom_violin(aes(x = cut, y = price))

ggplot(diamonds) +
  geom_lv(aes(x = cut, y = price))

The echo: false option disables the printing of code (only output is displayed).