10.3.3 Exercises Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.) How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference? Compare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows? 10.4.1 Exercises What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts? What does na.rm = TRUE do in mean() and sum()? Recreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.5)
smaller <- diamonds |>
filter(carat < 3)
ggplot(smaller, aes(x = carat)) +
geom_histogram(binwidth = 0.01)
ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
unusual <- diamonds |>
filter(y < 3 | y > 20) |>
select(price, x, y, z) |>
arrange(y)
unusual
## # A tibble: 9 × 4
## price x y z
## <int> <dbl> <dbl> <dbl>
## 1 5139 0 0 0
## 2 6381 0 0 0
## 3 12800 0 0 0
## 4 15686 0 0 0
## 5 18034 0 0 0
## 6 2130 0 0 0
## 7 2130 0 0 0
## 8 2075 5.15 31.8 5.12
## 9 12210 8.09 58.9 8.06
#> # A tibble: 9 × 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
hist(diamonds$x)
hist(diamonds$y)
hist(diamonds$z)
We learn that diamonds with a higher depth relative to their width might have different light reflection properties compared to those with lower depth, affecting their visual appeal. To determine dimensions, we consider the shapes and proportions of the diamonds when viewed from the top. For example, length is typically the longest dimension of the diamond when viewed from the top. For round diamonds, the length and width will be similar, but for elongated shapes like oval, the length is measurably longer than the width. For the width, we can determine this by the angle it makes with the length, usually perpendicular (90 degrees) To determine depth, we consider the height of the diamond viewed from the top to the bottom. It’s a key factor in determining the diamond’s cut quality, as the depth percentage impacts how light travels within the diamond.
hist(diamonds$price)
We noticed that a number of diamonds have zero measurements fr x, y and z which is unusual. This could be due to a data entry error. Two diamonds with large y dimensions. This could be an indicator of outliers or errors. When the binwidth is large, we might not be able to see potential errors or outliers whereas a smaller binwidth allows us to see them.
count_099 <- diamonds %>%
filter(carat >= 0.99, carat < 1) %>% summarise(count = n())
message(count_099, " diamonds are 0.99 carats")
## 23 diamonds are 0.99 carats
count_1 <- diamonds %>% filter(carat >= 1, carat < 1.01) %>% summarise(count = n())
message(count_1, " diamonds are 1 carat")
## 1558 diamonds are 1 carat
If we leave binwidth unset in geom_histogram(), ggplot2 will automatically calculate a bin width that it considers suitable based on the range of the data and the default number of bins it tries to create. This might result in less-than-ideal binning where some important features of the data are not well represented in the histogram.
Zooming in on a histogram using coord_cartesian() so that only half a bar shows, the bar will still be displayed fully because coord_cartesian() does not change how the data is binned or which data points are included. If you do the same with xlim() or ylim(), the bar would not be shown at all if its bin is partially outside the limits, because xlim()/ylim() affects the data that goes into the histogram.
##10.4.1 Exercises
#What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts? Missing values in a histogram are typically ignored. in a histogram they do not fall into any bin because a histogram requires a specific numeric value to place a data point within a range on the x-axis.
For a bar chart, missing values can be treated as a category. Here we can have a specific bar to represent the count of missing values because a bar chart’s x-axis is not inherently numerical.
The difference is that histograms are meant for continuous data, so it doesn’t make sense to have a bin for missing values.
Bar charts are for categorical data and missing values can be treated as a distinct category. Continuous variables can take on any value within a range, while categorical variables represent distinct groups or categories.
#What does na.rm = TRUE do in mean() and sum()? It is used in many functions to handle missing values, which are represented by NA. The acronym na.rm stands for “NA remove”. Setting na.rm = TRUE, instructs the function to remove the NA values from the calculation.
#Recreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights. # ```{r} # library(ggplot2) # # ggplot(nycflights13::flights , aes(x = scheduled_dep_time, fill = cancelled)) + # geom_histogram(position = “dodge”) + # facet_grid(. ~ cancelled, scales = “free_y”) + # theme_minimal() + # labs(x = “Scheduled Departure Time”, y = “Frequency”, fill = “Cancelled”)
```