The objective of this workshop is to introduce you the the art of Exploratory Data Analysis (EDA).
The introduction to section 7.1 in R4DS gives a short and useful overview of what EDA is.
In this project we will be working with the diamonds data set. In the console type ?diamonds to link to a help file describing the dimond data base and its variables.
This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.
In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.
Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.
Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.
Compute the mean (arithemetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)
mean(1:100) # <- you type this
## [1] 50.5
If necessary install the tidyverse.
In the console enter View(diamonds)
In the console type ?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.
Create a “data dictionary” in which you list each variable and its definition.
price - price of diamond in US dollars carat - weight of the diamond (separate unit) cut - the cut quality of the diamond (Fair, Good, Very Good, Premium, Ideal) color - diamond color quality (J - D, worst to Best) clarity - the measurement of how cloear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) x - length of the diamond in mm y - width of the diamond in mm z - depth (height) of the diamond in mm depth - depth percentage (depth/width, z/mean(x, y)) table - width of the top of the diamond to its widest point
diamonds using the summary function.summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Quantitative variable summaries offer the 5 number summary as well as the mean. However, categorical variables’ summary only gives the count of each category per variable.
colorp1 <- ggplot(diamonds, aes(x = color, fill = color)) +
geom_bar() +
labs(title = "Number of Diamonds per Color Quality",
x = "Color",
y = "Count") +
theme_solarized() +
theme(axis.text.x = element_text(size = 10),
plot.title = element_text(hjust = 0.50),
legend.position = "none")
p1
dplyr function count produce a frequency table for color in the below code chunck.freq <- diamonds %>%
count(color)
freq
## # A tibble: 7 × 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
carat from the diamonds data set.p2 <- ggplot(diamonds, aes(x = carat)) +
geom_histogram() +
labs(title = "Number of Diamonds Based on its Carats Value",
x = "Carats",
y = "Count") +
theme_solarized() +
theme(axis.text.x = element_text(size = 10),
plot.title = element_text(hjust = 0.50))
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
binwidth of 0.01 0.05, 0.1, 0.25, 0.5. Waht do you observe about the resulting histogram.# plot 1
p2 <- ggplot(diamonds, aes(x = carat)) +
geom_histogram(alpha = 1.0, binwidth = 0.01) +
labs(x = "Carats",
y = "Count") +
theme(plot.title = element_text(hjust = 0.50))
# plot 2
p3 <- ggplot(diamonds, aes(x = carat)) +
geom_histogram(alpha = 1.0, binwidth = 0.05) +
labs(x = "Carats",
y = "Count") +
theme(plot.title = element_text(hjust = 0.50))
# plot 3
p4 <- ggplot(diamonds, aes(x = carat)) +
geom_histogram(alpha = 1.0, binwidth = 0.1) +
labs(x = "Carats",
y = "Count") +
theme(plot.title = element_text(hjust = 0.50))
#plot 4
p5 <- ggplot(diamonds, aes(x = carat)) +
geom_histogram(alpha = 1.0, binwidth = 0.25) +
labs(x = "Carats",
y = "Count") +
theme(plot.title = element_text(hjust = 0.50))
#plot 5
p6 <- ggplot(diamonds, aes(x = carat)) +
geom_histogram(alpha = 1.0, binwidth = 0.50) +
labs(x = "Carats",
y = "Count") +
theme(plot.title = element_text(hjust = 0.50))
title <- textGrob("Different Bin Width Values of the Same Graph", gp = gpar(fontface = "bold", cex = 1.5))
grid.arrange(p2, p3, p4, p5, p6 , nrow = 3, ncol = 2, top = title, bottom = "Respectively, the binwidth values are 0.01, 0.05, 0.10, 0.25, and 0.50")
cut_width() make a table of carat frequencies with binwidth = 0.75, compare your table with the corresponding histogram.diamonds %>%
count(cut_width(carat, 0.75))
## # A tibble: 8 × 2
## `cut_width(carat, 0.75)` n
## <fct> <int>
## 1 [-0.375,0.375] 12024
## 2 (0.375,1.12] 30983
## 3 (1.12,1.88] 8722
## 4 (1.88,2.62] 2147
## 5 (2.62,3.38] 53
## 6 (3.38,4.12] 8
## 7 (4.12,4.88] 2
## 8 (4.88,5.62] 1
# This frequency table looks most similar to the histogram on the bottom right. Except this freq table, has a much higher count of carats between -0.375 - 0.375 than the histogram. Nevertheless, the binwidth is higher than that of any of those in the histogram, so therefore none of the graphs will look exactly the same as the data shown in the freq table.
diamonds1 <- diamonds %>%
filter(carat < 2)
lesshthan2 <- ggplot(diamonds1, aes(x = carat)) +
geom_histogram(alpha = 1.0, binwidth = 0.1) +
labs(title = "Number of Diamonds with Carat Values Less Than 2",
x = "Carats",
y = "Count") +
theme_solarized() +
theme(axis.text.x = element_text(size = 10),
plot.title = element_text(hjust = 0.50))
lesshthan2
geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?ggplot(diamonds, aes(x = carat, color = color)) +
labs(title = "Number of Diamonds per Carat Value (Separated by Color)",
x = "Carats",
y = "Count",
color = "Diamond Color") +
theme(plot.title = element_text(hjust = 0.50)) +
geom_freqpoly(binwidth = 0.1)
summary(diamonds$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.710 5.700 5.731 6.540 10.740
summary(diamonds$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.720 5.710 5.735 6.540 58.900
summary(diamonds$z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.910 3.530 3.539 4.040 31.800
# The summary stats show that the width (y) of the diamonds are generally much greater in value than the other dimensions. This is because the width of the diamond measures from the two furthest points horizontally. The next greatest maximum value come from the depth/height (z). This is because the depth measures the diamond from its highest point to lowest point vertically. However, this category also has the lowest values for the other statistics. This may be due to the fact that many diamonds do not have or need a big height. Possibly for small jewelry purposes such as rings and earrings. Lastly, the length (x) of the diamond is shown to have the smallest maximum value. This is because the length of the diamond measures the table of the diamond, or the top-most flat portion of the diamond. Reasonably, this would have the smallest maximum value.
ggplot(diamonds, aes(x = price)) +
geom_histogram(alpha = 1.0, binwidth = 50) +
labs(x = "Prices",
y = "Count") +
theme(axis.text.x = element_text(size = 10),
plot.title = element_text(hjust = 0.50))
# Overall, no matter the binwidth, the price range seems to greatly concentrate around the $1000 - $2000 price marking. The size of the diamond usually plays a massive role when determining its price. And as seen in the histograms above, this one looks very similar and if not identical to those of the distribution of carats.
geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?Binwidth determines the column range/size. Therefore, the smaller the binwidth, the more columns and detailed the histogram will get. On the other hand, bins determines the number of columns. And conversely to binwidth, as this number increases, the number of columns obviously also increases. Usually, bins would be prefered over binwidth, but if the exact range for each column is known, it would be better to use binwidth as it would provide a much more detailed visualization.