Diamond Workshop

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Introduction

The objective of this workshop is to introduce you the the art of Exploratory Data Analysis (EDA).
The introduction to section 7.1 in R4DS gives a short and useful overview of what EDA is.
In this project we will be working with the diamonds data set. In the console type ?diamonds to link to a help file describing the dimond data base and its variables.

About this document.

This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.

In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.

Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.

Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.

Here is an example:

Compute the mean (arithemetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)

mean(1:100) # <- you type this

## [1] 50.5

Getting started:

If necessary install the tidyverse.
In the console enter View(diamonds)
In the console type ?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.
Create a “data dictionary” in which you list each variable and its definition.

str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

carat (num): the unit of measurement for the physical weight of diamond cut (factor): shape of diamond color (factor): level of chemical impurity clarity (factor): quality of diamond via blemishes depth (num): distance from top to bottom table (num): the flat facet on its surface price (int): price of diamond x (num): either length, width or circumfrance of diamond y (num): either length, width or circumfrance of diamond z (num): either length, width or circumfrance of diamond

In the code chunck below, create a summary of diamonds using the summary function.

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

How does the summary of a categorical variable differ from the summary of a quantitative variable?

Category summary gives a frequency table. Quantitative gives statistical info such as mean, median, mode, and quartile.

Categorical variables: {n} frequency {N} denominator, or cohort size {p} formatted percentage

Continuous variables: {median} median {mean} mean {sd} standard deviation {var} variance {min} minimum {max} maximum {p##} any integer percentile, where ## is an integer from 0 to 100 {foo} any function of the form foo(x) is accepted where x is a numeric vector

In the code chunck below create a barchart visualization of color

diamonds %>%
  ggplot() +
  geom_bar(aes(x = color, fill = color), color = "white")

Using the dplyr function count produce a frequency table for color in the below code chunck.

diamonds %>%
  count(color)

## # A tibble: 7 x 2
##   color     n
##   <ord> <int>
## 1 D      6775
## 2 E      9797
## 3 F      9542
## 4 G     11292
## 5 H      8304
## 6 I      5422
## 7 J      2808

For examining the variability of a continuous numerical variable the first choice is frequently the histogram, A historam resembles a barchart,with an important difference.

Plot a histogram of carat from the diamonds data set.

diamonds %>%
  ggplot() +
  geom_histogram(aes(x = carat))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In a histogram the range of data values is divided into bins. The the number of bins is variable depending on the width of the bin. Plot the above histogram with binwidth of 0.01 0.05, 0.1, 0.25, 0.5. Waht do you observe about the resulting histogram.

diamonds %>%
  ggplot() +
  geom_histogram(aes(x = carat), binwidth = .02)

Using the ggplot2 function cut_width() make a table of carat frequencies with binwidth = 0.75, compare your table with the corresponding histogram.

Plot a histogram with a binwidth of 0.1 but only for diamonds with carat < 2.

diamonds %>%
  ggplot() +
  geom_histogram(aes(x = carat), binwidth = .01)

Read about geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = price, y = ..density.., color = color), binwidth = 400 )

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

summary(select(diamonds, x, y, z))

##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800

diamonds %>%
  ggplot() +
  geom_histogram(aes(x=x, color=cut), binwidth = 0.1)

diamonds %>%
  ggplot() +
  geom_histogram(aes(x=y, color=cut), binwidth = 0.1)

diamonds %>%
  ggplot() +
  geom_histogram(aes(x=z, color=cut), binwidth = 0.1)

##Mean and max are higher for y so it is likely length, then x for width, and z for depth.

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

diamonds %>%
  filter(price < 5000) %>%
  ggplot() +
  geom_histogram(aes(x=price), binwidth = 10)

There’s a break in price around 1,500.

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

diamonds %>%
 ## filter(price < 5000) %>%
  ggplot() +
  geom_histogram(aes(x=price), binwidth = 10) +
  coord_cartesian(xlim = c(1400,1600), ylim = c(0,200))

Using cord_cartesian you can zoom in a desired section using x and y coordinates

diamonds %>%
 ## filter(price < 5000) %>%
  ggplot() +
  geom_histogram(aes(x=price), binwidth = 10) +
  xlim(1400, 1600) +
  ylim(0,200)

## Warning: Removed 52871 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

You can do the same with xlim and ylim combined to zoom in on a specific part. Looks like it also dropped all values outside of the x & y lim.

In geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?

diamonds %>%
  ggplot() +
  geom_histogram(bins = 10, aes(x=price))

diamonds %>%
  ggplot() +
  geom_histogram(aes(x=price), binwidth = 10)

‘Bins’ is the number of rectangles in the histogram, whereas ‘bindwidth’ is an interval of measurement of the x variable. ‘Binwidth’ is important as it determines how you visualize how the data is spread, whereas ‘bins’ allows you to manually override number of rectangles.