The objective of this workshop is to introduce you the the art of Exploratory Data Analysis (EDA).
The introduction to section 7.1 in R4DS gives a short and useful overview of what EDA is.
In this project we will be working with the diamonds data set. In the console type ?diamonds to link to a help file describing the dimond data base and its variables.
This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.
In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.
Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.
Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.
Compute the mean (arithemetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)
mean(1:100) # <- you type this
## [1] 50.5
If necessary install the tidyverse.
In the console enter View(diamonds)
In the console type ?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.
Create a “data dictionary” in which you list each variable and its definition.
price: price in US dollars ($326-$18,823) carat: weight of the diamond (0.2-5.01) cut: quality of the cut (fair, good, very good, premium, ideal) color: diamond color, from D (best) to J (worst) clarity: a measurement of how clear the diamond is (l1 (worst), Sl2, Sl1, VS2, VS1, VVS2, VVS1, IF (best)) x: length in mm (0-10.74) y: width in mm (0-58.9) depth: total depth percentage = z / mean(x,y) = 2 * z / (x+y) (43-79) table: width of top of diamond relative to widest point (43-95)
diamonds using the summary function.summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
A categorical variable summary like “cut” does not list quartiles while a quantitative variable summary lists the 1st and 3rd quartiles as wells as things like mean and median which can’t be made for categorical variables.
colorggplot(data = diamonds) +
geom_bar(mapping = aes(x = color))
dplyr function count produce a frequency table for color in the below code chunk.diamonds %>%
count(color)
## # A tibble: 7 x 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
carat from the diamonds data set.ggplot(diamonds, mapping = aes(x = carat)) +
geom_histogram(bin_width = 1)
## Warning: Ignoring unknown parameters: bin_width
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
binwidth of 0.01 0.05, 0.1, 0.25, 0.5. What do you observe about the resulting histogram. As the binwidth gets bigger, the blocks in the chart get wider and you can see less detail. You can see the direction of the data more easily when the binwidth is smaller because you can see peaks. However I can see how changing the binwidths could potentially be an issue if the chart were to be misinterpreted some way (just since you’re altering the appearance).ggplot(diamonds, mapping = aes(x = carat)) +
geom_histogram(bin_width = 0.01)
## Warning: Ignoring unknown parameters: bin_width
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cut_width() make a table of carat frequencies with binwidth = 0.75, compare your table with the corresponding histogram.diamonds %>%
count(cut_width(carat, 0.75))
## # A tibble: 8 x 2
## `cut_width(carat, 0.75)` n
## <fct> <int>
## 1 [-0.375,0.375] 12024
## 2 (0.375,1.12] 30983
## 3 (1.12,1.88] 8722
## 4 (1.88,2.62] 2147
## 5 (2.62,3.38] 53
## 6 (3.38,4.12] 8
## 7 (4.12,4.88] 2
## 8 (4.88,5.62] 1
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.75)
carat2 <- diamonds %>%
filter(carat < 2)
ggplot(data = carat2, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?diamonds %>%
ggplot() +
geom_freqpoly(aes(x = price, y = ..density.., color = color), binwidth = 0.1)
If I didn’t know from the data dictionary that x is length, y is width, and z is depth, I could try and figure out which one is which by looking at the depth variable. Since it says that total depth percentage = z/mean(x,y) = 2*z/(x+y) (43-79), it seems most likely that z is depth and x and y are length and width. If I knew more about the size of diamonds I could look at the ranges by doing summary(diamonds) and see if the ranges stood out to me. If you make histograms using x, y, and z variables on the x-axis in turn with binwidth = 0.5, you see that the distribution varies but y and z are much more similar than x.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.5)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.5)
I started with binwidth = 0.5 but found that I needed to drastically increase the binwidth in order to see the distribution in detail. This is likely because the price range is so large. However, one odd thing is that looking at the histogram one would think that the prices range from roughly $100 to over $150,000. But when I look at the help page for the diamonds dataset, I see that the highest price is $18,823. So there must be something off with the price variable and the histogram because the help page says that price is measured in US dollars (so it’s not like we have to convert cents to dollars).
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 10)
Using coord_cartesian() vs xlim() or ylim() allows viewers to see some odd points that could potentially be outliers (y = ~30 and y = ~60). Zooming in by restricting the x-axis to 0-10 allows viewers to see that most diamonds have a width in a specific range. After making the x-axis restricted to 0-5 we can see that the range starts at ~3.5. Rstudio tells you to pick a binwidth if you leave the binwidth unset.Seeing the graph with only half a bar with xlim = c(0, 5) allowed viewers to pinpoint where the range of most frequent widths started.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 1) +
coord_cartesian(ylim = c(0, 50), xlim = c(0, 5))
geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?In a histogram, the bins are the equally-spaced intervals on the x-axis where observations are sorted into. The height of the bar shows how many observations are in each bin. Therefore, by varying the number of bins, you can change the height of the bars and potentially play up or play down certain parts of the dataset. For example, mapping the “y” variable on the x-axis looks a lot different when bins = 5 compared to bins = 10. Once you start changing the number of bins, tracking the binwidth can be difficult because it may not be divided into whole numbers. Binwidth is the width of the intervals. Changing the binwidth from 0.5 to 1 to 10 when making a histogram where x = price made the graph easier to read because the larger binwidth size made the height of the bars comparable to the count on the y axis. Similarly to bins, one can change the way the graph looks and play up or play down certain parts of the dataset by changing the binwidths. Changing the binwidths and/or bins is helpful if you’re trying to spot peaks. If your dataset is small there may be more changes that you can do by varying binwidth than by varying bins.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), bins = 5)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), bins = 10)