The objective of this workshop is to introduce you the the art of Exploratory Data Analysis (EDA).
The introduction to section 7.1 in R4DS gives a short and useful overview of what EDA is.
In this project we will be working with the diamonds data set. In the console type ?diamonds to link to a help file describing the dimond data base and its variables.
This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.
In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.
Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.
Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.
Compute the mean (arithemetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)
mean(1:100) # <- you type this
## [1] 50.5
tidyverse.#install.packages("tidyverse")
View(diamonds)view(diamonds)
?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.?diamonds
## starting httpd help server ... done
price: price in US dollars ($326–$18,823)
carat: weight of the diamond (0.2–5.01)
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond color, from D (best) to J (worst)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm (0–10.74)
y: width in mm (0–58.9)
z: in mm (0–31.8)
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table: width of top of diamond relative to widest point (43–95)
diamonds using the summary function.summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Categorical variables take a category/label value and place it into a group. Quantitative variables are numerical values which represent a figure of measurement.
colorggplot(data = diamonds) +
geom_bar(mapping = aes(x = color))
dplyr function count produce a frequency table for color in the below code chunck.diamonds %>%
count(color)
## # A tibble: 7 x 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
carat from the diamonds data set.ggplot(diamonds) +
geom_histogram(aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
binwidth of 0.01 0.05, 0.1, 0.25, 0.5. Waht do you observe about the resulting histogram.ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01)
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.05)
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.1)
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.25)
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.5)
#The columns are getting thicker as there are less categories to group the data points with.
cut_width() make a table of carat frequencies with binwidth = 0.75, compare your table with the corresponding histogram.diamonds %>%
count(cut_width(carat, 0.75))
## # A tibble: 8 x 2
## `cut_width(carat, 0.75)` n
## <fct> <int>
## 1 [-0.375,0.375] 12024
## 2 (0.375,1.12] 30983
## 3 (1.12,1.88] 8722
## 4 (1.88,2.62] 2147
## 5 (2.62,3.38] 53
## 6 (3.38,4.12] 8
## 7 (4.12,4.88] 2
## 8 (4.88,5.62] 1
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.75)
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.1) +
coord_cartesian(xlim = c(0, 2))
geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = price, color=color), binwidth = .1, boundary = 0)
#ggplot(data = diamonds) +
#geom_freqpoly(mapping = aes(x = price, y = depth, color=color), binwidth = .1, boundary = 0)
#Error in the above code as stat_bin can only have an x or y aesthetic.
summary(select(diamonds, x, y, z))
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
ggplot(diamonds) +
geom_histogram(mapping = aes(x=x), binwidth=0.01)
ggplot(diamonds) +
geom_histogram(mapping = aes(x=y), binwidth=0.01)
ggplot(diamonds) +
geom_histogram(mapping = aes(x=z), binwidth=0.01)
# y and z have outliers
# x, y and z all have spikes and troughs
# they are skewed to the left side of the distribution
#Remove the outliers:
filter(diamonds, x > 0, x < 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
scale_x_continuous(breaks = 1:10)
filter(diamonds, y > 0, y < 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
scale_x_continuous(breaks = 1:10)
filter(diamonds, z > 0, z < 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
scale_x_continuous(breaks = 1:10)
#x=length, y=width, z=depth
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
geom_histogram(binwidth = 10, center = 0)
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
geom_histogram(binwidth = 50, center = 0)
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
geom_histogram(binwidth = 1, center = 0)
# There is an increase in the count for prices between 500 and 1000.
# There is also a gap around the 1500 price area.
ggplot(diamonds) +
geom_histogram(aes(x = carat)) +
xlim(c(0.95, 1.05))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 47617 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(diamonds) +
geom_histogram(aes(x = carat)) +
coord_cartesian(xlim = c(0.95, 1.05))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.015) +
xlim(c(0.95, 1.055))
## Warning: Removed 47617 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.015) +
coord_cartesian(xlim = c(0.95, 1.055))
# xlim() does not display the half bar but coord_cartesian does.
geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?Binwidth is the width of the bins (columns) while bins are car columns corresponding to how many data points are in that bin. Looking at bins is preferable when you want to look at the general overall trend while binwidth is preferred when analyzing when looking at specific datapoints.