Introduction

Cognitive psychology literature suggests that self-explanation of expert solutions to worked examples are an efficient method of organizing new information. The goal of this exercise is to develop familiarity with R, RStudio, and RMarkdown Notebooks by reproducing a worked example of basic exploratory data analysis. The exercise is an adaptation of a blog post shared by Prof Nicholas Horton from Amherst College.

As the article suggested, there are several ways to learn (and use) R. For the purposes of our class, it’s useful to learn a model-centric approach to R:

function( Y ~ X, data = DataSetName )

Here’s a short description of each part in the pseudo-code above:

Front-matter

Most likely, you’ll want to load one or more R packages to accomplish your goals. For example, the model-centric approach to R that we’re introducing is based on a group of tools published in an R package called mosaic.

There are two steps when you want to use a package: (1) install the package from the source and (2) call the package for use in your program.

If you think of an R package like a song you buy online. You buy each song one time to add them to your music library. If you want to make a playlist, you need to go to your music library and select the songs you want to hear.

Your R script is like the playlist and packages are the songs. You only need to install an R package once (using the install.packages() function). When you write an R script, you need to add/call the packages you want to include (using the library() function)… You wouldn’t buy a song everytime you want to hear it; once you own it, you just add it to the playlist.

Load Packages

We want to use the following packages:

  • mosiac
  • mosaicData
  • tidyverse

If you’re using the PSU RStudio Server, the packages have already been installed on the server, so you can just call them with the library() command. If you are using another implementation of R, you need to install the packages for the first time before using with install.packages()

# Note: the hashtag character (#) designates user comments... use them *very* generously

# Load Packages 
library(mosaic)   # comments can even be appended to the line of code
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggplot2
## Loading required package: ggstance
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median,
##     prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(mosaicData)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## āœ” tibble  2.1.1     āœ” purrr   0.3.2
## āœ” tidyr   0.8.3     āœ” stringr 1.3.1
## āœ” readr   1.3.1     āœ” forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## āœ– mosaic::count() masks dplyr::count()
## āœ– purrr::cross()  masks mosaic::cross()
## āœ– mosaic::do()    masks dplyr::do()
## āœ– tidyr::expand() masks Matrix::expand()
## āœ– dplyr::filter() masks stats::filter()
## āœ– dplyr::lag()    masks stats::lag()
## āœ– mosaic::stat()  masks ggplot2::stat()
## āœ– mosaic::tally() masks dplyr::tally()
library(dplyr)
library(ggformula)
library(ggplot2)

Exploration the of the diamonds data

Complete the RMarkdown Notebook by adding narrative & R code chunks to reproduce & extend the exercise in the article.

Inspecting the data source

Now you’re ready to learn a little bit about the diamonds data set. Edit the bullet list to add a short description in your own words describing what each function does. (Warning, one of these functions only works in the R console and not in RMarkdown. Be sure to mention that in your description for that one) - glimpse() - head() - names() - View()

Basic forms of data inspection

- Glimpse(): This is like a transposed version of print, making it possible to see every column in a data frame.
- Head(): Shows the first several rows of a matrix or data frame
- Names(): Shows the names of the headers
- View():Allows you to view the data (note - it only works in R console and cannot be viewed in R Markdown)
# Inspecting the data source

glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Goo…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1,…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, …
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 3…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.…
head(diamonds)
names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
view(diamonds)

Some Data Prep

The following is a little bit of data wrangling to get the source data in shape for our purposes. You should ignore this part for now, and we can talk about it another time.

# Recode & filter (no edits needed)
recoded <- 
  diamonds %>%
  filter(color=="D" | color=="J") %>%
  mutate(col = as.character(color))

Basically, we’re going to do a bit of exploration of variables that impact cost of diamonds. Even if you haven’t used R before, you might be able to tell from the code that we start with the diamonds data, filtered (i.e.Ā restricted) our data set to only include the diamonds that are either color ā€œDā€ or ā€œJā€, created a new categorical variable called ā€œcolā€, and stored the whole thing in a new data set called ā€œrecodedā€.

Statisticians should always know something about the data domain in order to be useful, if you don’t know anything about the subject area you need to at least learn some basics. Wikipedia is usually a good place to start: https://en.wikipedia.org/wiki/Diamond_color.

Exploratory Data Analysis

Always start with clear research questions. Our question for this exercise:

  1. How do diamond prices compare for ā€œDā€ and ā€œJā€ colored diamonds?

The purpose of the exploratory data analysis (EDA) is to learn as much as you can about your research question before doing any fancy statistical modeling. We basically want to try and answer the research question with EDA if possible, and then we use statistical models to formally accommodate variability in the data and calculate the uncertainty of our conclusions.

Mean price by color

Use the R code chunk to calculate the mean price by color just like the article. Summarize your observations below the code chunk. Don’t forget, we did some data wrangling above and made a new data set called ā€œrecoded.ā€ Use the ā€œrecodedā€ data for the rest of this analysis.

mean(price ~ col, data = recoded)
##        D        J 
## 3169.954 5323.818
It appears that J diamonds, on average, are more expensive than the D Diamonds

Other summary statistics by color

Use the R code chunk to calculate the other summary statistics for price of each diamond color using favstats() just like the article. Summarize your observations below the code chunk.

favstats(price ~ col, data = recoded)
Based on the summary statistics, it appears that while the J diamonds are more expensive than the D diamonds, the J diamonds also have much more varibility within their price.

,

2-3 Basic plots of the data

Make side-by-side boxplots using the R code shown in the article, and then use the mplot() function to generate 1-2 additional plots that show the data in a different way. Warning: mplot() is one of those functions that only works in the console… use mplot() to explore the data, and then select the gear icon in the upper left, choose ā€œGraphics System: gg_formulaā€, and click ā€œShow Expressionā€. This prints some code in the console that you can copy & paste into your RMarkdown document.

gf_boxplot(price ~ col, data = recoded)

gf_histogram(~ carat, data = recoded, binwidth = 0.19) %>%
gf_labs(title = "Number of Carats")

gf_violin(carat ~ ntiles(depth), data = recoded) %>% 
gf_labs(title = "Carat vs Depth", caption = "")

gf_boxplot(y ~ cut, data = recoded) %>% 
gf_labs(title = "", caption = "")

Basic statistical modeling

Hopefully you feel you have a sense of the answer to your research question. Now we use statisical models to formalize the results.

Two-sample t-test

We’re trying to compare two group means, so a two-sample t-test is a useful tool for that purpose. Reproduce the two-sample t-test from the article, and then write a conclusion based on the t-test below the code chunk.

t.test(price ~ col, data = recoded)
## 
##  Welch Two Sample t-test
## 
## data:  price by col
## t = -23.121, df = 4197.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2336.496 -1971.232
## sample estimates:
## mean in group D mean in group J 
##        3169.954        5323.818

[conclusion from the t-test here…]

Based on the data we see that there is a statistically significant difference between to the two groups and could conclude that J diamonds tend to cost more than D diamonds.

Linear model

You could arrive at the same conclusion using a linear model. In fact, the t-test is a special case of a more general linear model, so that’s why the result is the same. We’ll talk more about that another time. Here,

msummary(lm(price ~ col, data = recoded))
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3169.95      45.03   70.40   <2e-16 ***
## colJ         2153.86      83.18   25.89   <2e-16 ***
## 
## Residual standard error: 3706 on 9581 degrees of freedom
## Multiple R-squared:  0.0654, Adjusted R-squared:  0.0653 
## F-statistic: 670.4 on 1 and 9581 DF,  p-value: < 2.2e-16

Multivariable relationships

The world is often too complicated to be understood by studying one or two variables at a time. Color is certainly not the only variable that impacts the value of a diamond. You may have noticed in the wikipedia article that color is only one of the ā€œ4 C’sā€ that influence the value of a diamond.

Adjusing for other variables

Create a scatter plot of price vs carat (a measure of diamond weight) that colors each plotted point according the color of the diamond it represents. Write a few sentances below the code chunk to explain what you’ve observed from this plot.

gf_point(price ~ carat, colour = ~ col, data = recoded)

We can see from the scatter plot that there are far more extreme values in the variable of number of carats. This accounts for the variability of the price in the J diamonds and the difference in price between the two groups. Looking at this scatter plot it appears that D diamonds are worth more per carat, but there are far more examples of J diamonds with 2 or more carats. This may explain why the J diamonds, on average, cost more than the D diamonds.

Lastly, use mplot() to explore the recode data further. Add a model (try linear & smooth), play with plot types, facets, key, scales, etc. There is absolutely no risk that you’ll ā€œbreakā€ something. You’ll make some good plots that way, but don’t be surprised if you go through a lot of ugly, senseless plots along the way…

gf_point(price ~ carat, data = recoded) %>%
gf_lm() %>% 
gf_labs(title = "Price by Carat", caption = "Linear Model")

gf_point(price ~ carat, data = recoded, color = ~ cut) %>%
gf_smooth() %>% 
gf_theme(legend.position = "right") %>% 
gf_labs(title = "Price per Carat & Cut", caption = "Smooth Model")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Price adjusted for carat weight and color

The last model in the article is a linear model for diamond price based on carat weight and color (i.e.Ā D vs J). Notice that we fit the model, store it in an object that we call mod, and then do the summary as a separate step. Can you see why R interprets this in the same way as the nested syntax from the previous use of msummary(...)?

msummary(recoded)
##      carat               cut       color       clarity         depth      
##  Min.   :0.2000   Fair     : 282   D:6775   SI1    :2833   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 969   E:   0   VS2    :2428   1st Qu.:61.00  
##  Median :0.7000   Very Good:2191   F:   0   SI2    :1849   Median :61.90  
##  Mean   :0.8056   Premium  :2411   G:   0   VS1    :1247   Mean   :61.75  
##  3rd Qu.:1.0300   Ideal    :3730   H:   0   VVS2   : 684   3rd Qu.:62.60  
##  Max.   :5.0100                    I:   0   VVS1   : 326   Max.   :73.60  
##                                    J:2808   (Other): 216                  
##      table           price               x               y         
##  Min.   :51.60   Min.   :  335.0   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  976.5   1st Qu.: 4.73   1st Qu.: 4.740  
##  Median :57.00   Median : 2310.0   Median : 5.67   Median : 5.680  
##  Mean   :57.52   Mean   : 3801.1   Mean   : 5.74   Mean   : 5.743  
##  3rd Qu.:59.00   3rd Qu.: 5084.5   3rd Qu.: 6.53   3rd Qu.: 6.530  
##  Max.   :73.00   Max.   :18710.0   Max.   :10.74   Max.   :10.540  
##                                                                    
##        z             col           
##  Min.   :0.000   Length:9583       
##  1st Qu.:2.930   Class :character  
##  Median :3.500   Mode  :character  
##  Mean   :3.545                     
##  3rd Qu.:4.030                     
##  Max.   :6.980                     
## 
# fit linear model
mod <- lm(price ~ carat + col, data = recoded)
msummary(mod)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1900.95      28.36  -67.03   <2e-16 ***
## carat        7708.95      33.68  228.88   <2e-16 ***
## colJ        -1734.08      36.86  -47.05   <2e-16 ***
## 
## Residual standard error: 1457 on 9580 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8555 
## F-statistic: 2.836e+04 on 2 and 9580 DF,  p-value: < 2.2e-16
# Plot the regression model
plotModel(mod, system = "ggplot2")  

Conclusions

What was your research question again? Restate it here and then write a paragraph or two abour your observations conclusions after studying these data.

Original Research Question:

How do diamond prices compare for ā€œDā€ and ā€œJā€ colored diamonds?

Summary of Conclusions:

When taking a surface look at the data by investigating means, we see that there is a significant difference between J diamonds and D diamonds, with the former being more expensive on average. However, we can also see that there is a much larger spread in he data of the J diamonds than that of the D diamonds. When we take a closer look at the data, we can see that the quality of the D diamonds are generally higher when we look at price per carat and instances of premium and ideal cuts. The difference between the average price in the two groups coes primarily through the number of instances of diamonds within the J group havig 2 or more carats. So, we can also conclude that J diamonds are more likely to have more carats than D diamonds, and consequently more expensive, but D diamonds are typically higher quality per carat.