The objective of this workshop is to introduce you the the art of Exploratory Data Analysis (EDA).
The introduction to section 7.1 in R4DS gives a short and useful overview of what EDA is.
In this project we will be working with the diamonds data set. In the console type ?diamonds
to link to a help file describing the dimond data base and its variables.
This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.
In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.
Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.
Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.
Compute the mean (arithmetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)
mean(1:100) # <- you type this
## [1] 50.5
If necessary install the tidyverse
.
In the console enter View(diamonds)
In the console type ?diamonds
this will open a help page describing diamonds
. Read the help page and compare it’s contents with the data you see in the View pane.
Create a “data dictionary” in which you list each variable and its definition.
price: price in US dollars ($326–$18,823)
carat: weight of the diamond (0.2–5.01)
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond colour, from D (best) to J (worst)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm (0–10.74)
y: width in mm (0–58.9)
z: depth in mm (0–31.8)
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table: width of top of diamond relative to widest point (43–95)
diamonds
using the summary
function.summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Summary of a categorical variable like ‘cut’ shows the frequency of the various levels. Summary of a quantitative variable like ‘depth’ shows the basic statistics of the variable: minimum, 1st quartile, median, mean, 3rd quartile and maximum values.
color
diamonds %>% ggplot() +
geom_bar(aes(x=color, fill=color), color='white')
dplyr
function count
produce a frequency table for color
in the below code chunck.diamonds %>% count(color)
## # A tibble: 7 × 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
carat
from the diamonds
data set.diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .1)
binwidth
of 0.01 0.05, 0.1, 0.25, 0.5. Waht do you observe about the resulting histogram.diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .01)
diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .05)
diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .1)
diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .25)
diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .5)
# effect of binwidth: as the width decreases the count of diamonds in each bin decreases but the resolution or level of detail increases.
cut_width()
make a table of carat
frequencies with binwidth = 0.75, compare your table with the corresponding histogram.diamonds %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .75)
table(cut_width(x=diamonds$carat,width=.75))
##
## [-0.375,0.375] (0.375,1.12] (1.12,1.88] (1.88,2.62] (2.62,3.38]
## 12024 30983 8722 2147 53
## (3.38,4.12] (4.12,4.88] (4.88,5.62]
## 8 2 1
# as seen, there is direct correspondence between the histogram and the cut_table (graphical vs. numerical).
diamonds %>% filter(carat<2) %>% ggplot() +
geom_histogram(aes(x=carat), binwidth = .1)
geom_freqpoly()
and produce overlaid histograms with binwidth = 0.1' for each
color, what happens if in the you set ``x = price
, y = ..density..
in the aes for geom_freqpoly()
?diamonds %>% ggplot() +
#geom_histogram(aes(x=price)) +
geom_freqpoly(aes(x=price, y=..density.., color=color))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# higher quality diamonds (color: D, E, F) have high frequency density but the higher price goes to mid quality diamonds.
diamonds %>% ggplot() +
geom_bar(aes(x=x, fill=color)) +
coord_cartesian(xlim= c(0,10))
diamonds %>% ggplot() +
geom_bar(aes(x=y, fill=color)) +
coord_cartesian(xlim= c(0,10))
diamonds %>% ggplot() +
geom_bar(aes(x=z, fill=color)) +
coord_cartesian(xlim= c(0,10))
# examination of results suggests that x, y are length and width of similar dimensions, and z is depth of about 1/2 the length or width.
# another observation is that the spikes in frequency indicate that certain sizes predominate, forming about 5 to 6 groups around those dominant sizes. This is probably because diamonds are artificially cut into those 5 or 6 sizes. It is hard to tell whether quality (color) varies with sizes; in natural uncut diamonds it does vary.
diamonds %>% ggplot() +
geom_histogram(aes(x=price, fill=color), binwidth=80) +
coord_cartesian(ylim= c(0,2000))
# price is not only a function of quality (color). Price increases exponentially as count decreases
diamonds %>% ggplot() +
geom_histogram(aes(x=price, fill=color), binwidth=80) +
ylim(c(0,2000))
## Warning: Removed 4 rows containing missing values (geom_bar).
geom_histogram
what is the difference between binwidth
and bins
? When might you prefer one to another?diamonds %>% ggplot() +
geom_histogram(aes(x=price, fill=color), bins=80) +
ylim(c(0,2000))
## Warning: Removed 16 rows containing missing values (geom_bar).
#'bins' are preferred when the range of the x variable is unknown or may be different between runs; the aspect/resolution of the plot is uniform;
#'binwidth' is preferred when variable range is known