Introduction

About this document.

This is an RMarkdown document. RMarkdown is a package for literate coding. Literate coding is a way of programming in which the program instructions are interwoven with the documentation of the program.

In data science this means that we can produce one document which contains all our analytic steps, in such a way that another reader can read what you have have written, but also process the same data using the same software. This is a key requirement for reproducible research.

Moreover, combined with a version control system, like git hub, an RMarkdown document can be collaborative. We will talk more about that in the coming weeks.

Today’s workshop is focused on giving you a opportunity to use some of the skills you worked at developing since the last class. As you work through this document, you should type in your responses to the questions and run your code in the provided code blocks.

Here is an example:

Compute the mean (arithemetic average) of the numbers from 1 to 100.(enter your answer in the block below and run the block by clicking on the little green triangle in the upper right corner of the block.)

mean(1:100) # <- you type this
## [1] 50.5

Getting started:

  1. If necessary install the tidyverse.

  2. In the console enter View(diamonds)

  3. In the console type ?diamonds this will open a help page describing diamonds. Read the help page and compare it’s contents with the data you see in the View pane.

  4. Create a “data dictionary” in which you list each variable and its definition.

price - price of diamond in US dollars carat - weight of the diamond (separate unit) cut - the cut quality of the diamond (Fair, Good, Very Good, Premium, Ideal) color - diamond color quality (J - D, worst to Best) clarity - the measurement of how cloear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) x - length of the diamond in mm y - width of the diamond in mm z - depth (height) of the diamond in mm depth - depth percentage (depth/width, z/mean(x, y)) table - width of the top of the diamond to its widest point

  1. In the code chunck below, create a summary of diamonds using the summary function.
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
  1. How does the summary of a categorical variable differ from the summary of a quantitative variable?

Quantitative variable summaries offer the 5 number summary as well as the mean. However, categorical variables’ summary only gives the count of each category per variable.

  1. In the code chunck below create a barchart visualization of color
p1 <- ggplot(diamonds, aes(x = color,  fill = color)) +
  geom_bar() +
  labs(title = "Number of Diamonds per Color Quality", 
       x = "Color",
       y = "Count") +
  theme_solarized() +
  theme(axis.text.x = element_text(size = 10),
        plot.title = element_text(hjust = 0.50),
        legend.position = "none")

p1

  1. Using the dplyr function count produce a frequency table for color in the below code chunck.
freq <- diamonds %>%
  count(color)

freq
## # A tibble: 7 × 2
##   color     n
##   <ord> <int>
## 1 D      6775
## 2 E      9797
## 3 F      9542
## 4 G     11292
## 5 H      8304
## 6 I      5422
## 7 J      2808
  1. For examining the variability of a continuous numerical variable the first choice is frequently the histogram, A historam resembles a barchart,with an important difference.
p2 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram() +
  labs(title = "Number of Diamonds Based on its Carats Value", 
       x = "Carats",
       y = "Count") +
  theme_solarized() +
  theme(axis.text.x = element_text(size = 10),
        plot.title = element_text(hjust = 0.50))

p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# plot 1
p2 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(alpha = 1.0, binwidth = 0.01) +
  labs(x = "Carats",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.50))

# plot 2
p3 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(alpha = 1.0, binwidth = 0.05) +
  labs(x = "Carats",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.50))

# plot 3
p4 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(alpha = 1.0, binwidth = 0.1) +
  labs(x = "Carats",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.50))

#plot 4
p5 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(alpha = 1.0, binwidth = 0.25) +
  labs(x = "Carats",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.50))

#plot 5
p6 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(alpha = 1.0, binwidth = 0.50) +
  labs(x = "Carats",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.50))

title <- textGrob("Different Bin Width Values of the Same Graph", gp = gpar(fontface = "bold", cex = 1.5))

grid.arrange(p2, p3, p4, p5, p6 , nrow = 3, ncol = 2, top = title, bottom = "Respectively, the binwidth values are 0.01, 0.05, 0.10, 0.25, and 0.50")

diamonds %>%
  count(cut_width(carat, 0.75))
## # A tibble: 8 × 2
##   `cut_width(carat, 0.75)`     n
##   <fct>                    <int>
## 1 [-0.375,0.375]           12024
## 2 (0.375,1.12]             30983
## 3 (1.12,1.88]               8722
## 4 (1.88,2.62]               2147
## 5 (2.62,3.38]                 53
## 6 (3.38,4.12]                  8
## 7 (4.12,4.88]                  2
## 8 (4.88,5.62]                  1
# This frequency table looks most similar to the histogram on the bottom right. Except this freq table, has a much higher count of carats between -0.375 - 0.375 than the histogram. Nevertheless, the binwidth is higher than that of any of those in the histogram, so therefore none of the graphs will look exactly the same as the data shown in the freq table.
  1. Plot a histogram with a binwidth of 0.1 but only for diamonds with carat < 2.
diamonds1 <- diamonds %>%
  filter(carat < 2)

lesshthan2 <- ggplot(diamonds1, aes(x = carat)) +
  geom_histogram(alpha = 1.0, binwidth = 0.1) +
  labs(title = "Number of Diamonds with Carat Values Less Than 2",
       x = "Carats",
       y = "Count") +
  theme_solarized() +
  theme(axis.text.x = element_text(size = 10),
        plot.title = element_text(hjust = 0.50))

lesshthan2

  1. Read about geom_freqpoly() and produce overlaid histograms with binwidth = 0.1' for eachcolor, what happens if in the you set ``x = price, y = ..density.. in the aes for geom_freqpoly()?
ggplot(diamonds, aes(x = carat, color = color)) + 
  labs(title = "Number of Diamonds per Carat Value (Separated by Color)",
       x = "Carats",
       y = "Count",
       color = "Diamond Color") +
  theme(plot.title = element_text(hjust = 0.50)) +
  geom_freqpoly(binwidth = 0.1)

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
summary(diamonds$x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.710   5.700   5.731   6.540  10.740
summary(diamonds$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.720   5.710   5.735   6.540  58.900
summary(diamonds$z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.910   3.530   3.539   4.040  31.800
# The summary stats show that the width (y) of the diamonds are generally much greater in value than the other dimensions. This is because the width of the diamond measures from the two furthest points horizontally. The next greatest maximum value come from the depth/height (z). This is because the depth measures the diamond from its highest point to lowest point vertically. However, this category also has the lowest values for the other statistics. This may be due to the fact that many diamonds do not have or need a big height. Possibly for small jewelry purposes such as rings and earrings. Lastly, the length (x) of the diamond is shown to have the smallest maximum value. This is because the length of the diamond measures the table of the diamond, or the top-most flat portion of the diamond. Reasonably, this would have the smallest maximum value.
  1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(diamonds, aes(x = price)) +
  geom_histogram(alpha = 1.0, binwidth = 50) +
  labs(x = "Prices",
       y = "Count") +
  theme(axis.text.x = element_text(size = 10),
        plot.title = element_text(hjust = 0.50))

# Overall, no matter the binwidth, the price range seems to greatly concentrate around the $1000 - $2000 price marking. The size of the diamond usually plays a massive role when determining its price. And as seen in the histograms above, this one looks very similar and if not identical to those of the distribution of carats.
  1. In geom_histogram what is the difference between binwidth and bins? When might you prefer one to another?

Binwidth determines the column range/size. Therefore, the smaller the binwidth, the more columns and detailed the histogram will get. On the other hand, bins determines the number of columns. And conversely to binwidth, as this number increases, the number of columns obviously also increases. Usually, bins would be prefered over binwidth, but if the exact range for each column is known, it would be better to use binwidth as it would provide a much more detailed visualization.