Diamond Workshop

Getting started:

If necessary install the tidyverse.

#install.packages("tidyverse")

In the console enter View(diamonds)

view(diamonds)

In the console type ?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.

?diamonds

## starting httpd help server ... done

Create a “data dictionary” in which you list each variable and its definition.

price: price in US dollars ($326–$18,823)

carat: weight of the diamond (0.2–5.01)

cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color: diamond color, from D (best) to J (worst)

clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x: length in mm (0–10.74)

y: width in mm (0–58.9)

z: in mm (0–31.8)

depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table: width of top of diamond relative to widest point (43–95)

In the code chunck below, create a summary of diamonds using the summary function.

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

How does the summary of a categorical variable differ from the summary of a quantitative variable?

Categorical variables take a category/label value and place it into a group. Quantitative variables are numerical values which represent a figure of measurement.

In the code chunck below create a barchart visualization of color

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = color))

Using the dplyr function count produce a frequency table for color in the below code chunck.

diamonds %>%
  count(color)

## # A tibble: 7 x 2
##   color     n
##   <ord> <int>
## 1 D      6775
## 2 E      9797
## 3 F      9542
## 4 G     11292
## 5 H      8304
## 6 I      5422
## 7 J      2808

For examining the variability of a continuous numerical variable the first choice is frequently the histogram, A historam resembles a barchart,with an important difference.

Plot a histogram of carat from the diamonds data set.

ggplot(diamonds) + 
  geom_histogram(aes(x = carat))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In a histogram the range of data values is divided into bins. The the number of bins is variable depending on the width of the bin. Plot the above histogram with binwidth of 0.01 0.05, 0.1, 0.25, 0.5. Waht do you observe about the resulting histogram.

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01)

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.05)

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.1)

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.25)

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.5)

#The columns are getting thicker as there are less categories to group the data points with.

Using the ggplot2 function cut_width() make a table of carat frequencies with binwidth = 0.75, compare your table with the corresponding histogram.

diamonds %>%
  count(cut_width(carat, 0.75))

## # A tibble: 8 x 2
##   `cut_width(carat, 0.75)`     n
##   <fct>                    <int>
## 1 [-0.375,0.375]           12024
## 2 (0.375,1.12]             30983
## 3 (1.12,1.88]               8722
## 4 (1.88,2.62]               2147
## 5 (2.62,3.38]                 53
## 6 (3.38,4.12]                  8
## 7 (4.12,4.88]                  2
## 8 (4.88,5.62]                  1

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.75)

Plot a histogram with a binwidth of 0.1 but only for diamonds with carat < 2.

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.1) + 
  coord_cartesian(xlim = c(0, 2))

Read about geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?

ggplot(data = diamonds) +
  geom_freqpoly(mapping = aes(x = price, color=color), binwidth = .1, boundary = 0)

#ggplot(data = diamonds) +
  #geom_freqpoly(mapping = aes(x = price, y = depth, color=color), binwidth = .1, boundary = 0)

#Error in the above code as stat_bin can only have an x or y aesthetic.

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

summary(select(diamonds, x, y, z))

##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800

ggplot(diamonds) +
  geom_histogram(mapping = aes(x=x), binwidth=0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x=y), binwidth=0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x=z), binwidth=0.01)

# y and z have outliers 
# x, y and z all have spikes and troughs
# they are skewed to the left side of the distribution

#Remove the outliers:
filter(diamonds, x > 0, x < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, y > 0, y < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, z > 0, z < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

#x=length, y=width, z=depth

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 10, center = 0)

ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 50, center = 0)

ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 1, center = 0)

# There is an increase in the count for prices between 500 and 1000. 
# There is also a gap around the 1500 price area.

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  xlim(c(0.95, 1.05))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 47617 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  coord_cartesian(xlim = c(0.95, 1.05))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.015) +
  xlim(c(0.95, 1.055))

## Warning: Removed 47617 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.015) +
  coord_cartesian(xlim = c(0.95, 1.055))

# xlim() does not display the half bar but coord_cartesian does.

In geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?

Binwidth is the width of the bins (columns) while bins are car columns corresponding to how many data points are in that bin. Looking at bins is preferable when you want to look at the general overall trend while binwidth is preferred when analyzing when looking at specific datapoints.

Diamond Workshop

Tycho Gormley

7/22/2021

Introduction

About this document.

Here is an example:

Getting started: