library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The objective of this workshop is to introduce you the the art of Exploratory Data Analysis (EDA).
The introduction to section 7.1 in R4DS gives a short and useful overview of what EDA is.
In this project we will be working with the diamonds data set. In the console type ?diamonds to link to a help file describing the dimond data base and its variables.
This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.
In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.
Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.
Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.
Compute the mean (arithemetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)
mean(1:100) # <- you type this
## [1] 50.5
If necessary install the tidyverse.
In the console enter View(diamonds)
In the console type ?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.
Create a “data dictionary” in which you list each variable and its definition.
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
carat (num): the unit of measurement for the physical weight of diamond cut (factor): shape of diamond color (factor): level of chemical impurity clarity (factor): quality of diamond via blemishes depth (num): distance from top to bottom table (num): the flat facet on its surface price (int): price of diamond x (num): either length, width or circumfrance of diamond y (num): either length, width or circumfrance of diamond z (num): either length, width or circumfrance of diamond
diamonds using the summary function.summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Category summary gives a frequency table. Quantitative gives statistical info such as mean, median, mode, and quartile.
Categorical variables: {n} frequency {N} denominator, or cohort size {p} formatted percentage
Continuous variables: {median} median {mean} mean {sd} standard deviation {var} variance {min} minimum {max} maximum {p##} any integer percentile, where ## is an integer from 0 to 100 {foo} any function of the form foo(x) is accepted where x is a numeric vector
colordiamonds %>%
ggplot() +
geom_bar(aes(x = color, fill = color), color = "white")
dplyr function count produce a frequency table for color in the below code chunck.diamonds %>%
count(color)
## # A tibble: 7 x 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
carat from the diamonds data set.diamonds %>%
ggplot() +
geom_histogram(aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
binwidth of 0.01 0.05, 0.1, 0.25, 0.5. Waht do you observe about the resulting histogram.diamonds %>%
ggplot() +
geom_histogram(aes(x = carat), binwidth = .02)
cut_width() make a table of carat frequencies with binwidth = 0.75, compare your table with the corresponding histogram.diamonds %>%
ggplot() +
geom_histogram(aes(x = carat), binwidth = .01)
geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?diamonds %>%
ggplot() +
geom_freqpoly(aes(x = price, y = ..density.., color = color), binwidth = 400 )
summary(select(diamonds, x, y, z))
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
diamonds %>%
ggplot() +
geom_histogram(aes(x=x, color=cut), binwidth = 0.1)
diamonds %>%
ggplot() +
geom_histogram(aes(x=y, color=cut), binwidth = 0.1)
diamonds %>%
ggplot() +
geom_histogram(aes(x=z, color=cut), binwidth = 0.1)
##Mean and max are higher for y so it is likely length, then x for width, and z for depth.
diamonds %>%
filter(price < 5000) %>%
ggplot() +
geom_histogram(aes(x=price), binwidth = 10)
diamonds %>%
## filter(price < 5000) %>%
ggplot() +
geom_histogram(aes(x=price), binwidth = 10) +
coord_cartesian(xlim = c(1400,1600), ylim = c(0,200))
diamonds %>%
## filter(price < 5000) %>%
ggplot() +
geom_histogram(aes(x=price), binwidth = 10) +
xlim(1400, 1600) +
ylim(0,200)
## Warning: Removed 52871 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
You can do the same with xlim and ylim combined to zoom in on a specific part. Looks like it also dropped all values outside of the x & y lim.
geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?diamonds %>%
ggplot() +
geom_histogram(bins = 10, aes(x=price))
diamonds %>%
ggplot() +
geom_histogram(aes(x=price), binwidth = 10)
‘Bins’ is the number of rectangles in the histogram, whereas ‘bindwidth’ is an interval of measurement of the x variable. ‘Binwidth’ is important as it determines how you visualize how the data is spread, whereas ‘bins’ allows you to manually override number of rectangles.