Cognitive psychology literature suggests that self-explanation of expert solutions to worked examples are an efficient method of organizing new information. The goal of this exercise is to develop familiarity with R, RStudio, and RMarkdown Notebooks by reproducing a worked example of basic exploratory data analysis. The exercise is an adaptation of a blog post shared by Prof Nicholas Horton from Amherst College.
As the article suggested, there are several ways to learn (and use) R. For the purposes of our class, itās useful to learn a model-centric approach to R:
function( Y ~ X, data = DataSetName )
Hereās a short description of each part in the pseudo-code above:
function is an R function that dictates something you want to do with your data e.g.:
mean calculates the meant.test performs a two sample t-testlm fits a linear regression modelY is the outcome of interest (response variable)X is some explanatory variable or you can use ā1ā as a placeholder if there is no explanatory variableDataSetName is the name of a data set loaded into the R environmentMost likely, youāll want to load one or more R packages to accomplish your goals. For example, the model-centric approach to R that weāre introducing is based on a group of tools published in an R package called mosaic.
There are two steps when you want to use a package: (1) install the package from the source and (2) call the package for use in your program.
If you think of an R package like a song you buy online. You buy each song one time to add them to your music library. If you want to make a playlist, you need to go to your music library and select the songs you want to hear.
Your R script is like the playlist and packages are the songs. You only need to install an R package once (using the install.packages() function). When you write an R script, you need to add/call the packages you want to include (using the library() function)⦠You wouldnāt buy a song everytime you want to hear it; once you own it, you just add it to the playlist.
We want to use the following packages:
mosiacmosaicDatatidyverseIf youāre using the PSU RStudio Server, the packages have already been installed on the server, so you can just call them with the library() command. If you are using another implementation of R, you need to install the packages for the first time before using with install.packages()
# Note: the hashtag character (#) designates user comments... use them *very* generously
# Load Packages
library(mosaic) # comments can even be appended to the line of code
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggplot2
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median,
## prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(mosaicData)
library(tidyverse)
## āā Attaching packages āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse 1.2.1 āā
## ā tibble 2.1.1 ā purrr 0.3.2
## ā tidyr 0.8.3 ā stringr 1.3.1
## ā readr 1.3.1 ā forcats 0.4.0
## āā Conflicts āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse_conflicts() āā
## ā mosaic::count() masks dplyr::count()
## ā purrr::cross() masks mosaic::cross()
## ā mosaic::do() masks dplyr::do()
## ā tidyr::expand() masks Matrix::expand()
## ā dplyr::filter() masks stats::filter()
## ā dplyr::lag() masks stats::lag()
## ā mosaic::stat() masks ggplot2::stat()
## ā mosaic::tally() masks dplyr::tally()
library(dplyr)
library(ggformula)
library(ggplot2)
diamonds dataComplete the RMarkdown Notebook by adding narrative & R code chunks to reproduce & extend the exercise in the article.
Now youāre ready to learn a little bit about the diamonds data set. Edit the bullet list to add a short description in your own words describing what each function does. (Warning, one of these functions only works in the R console and not in RMarkdown. Be sure to mention that in your description for that one) - glimpse() - head() - names() - View()
- Glimpse(): This is like a transposed version of print, making it possible to see every column in a data frame.
- Head(): Shows the first several rows of a matrix or data frame
- Names(): Shows the names of the headers
- View():Allows you to view the data (note - it only works in R console and cannot be viewed in R Markdown)
# Inspecting the data source
glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.ā¦
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Gooā¦
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J,ā¦
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1,ā¦
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59ā¦
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, ā¦
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 3ā¦
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.ā¦
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.ā¦
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.ā¦
head(diamonds)
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
view(diamonds)
The following is a little bit of data wrangling to get the source data in shape for our purposes. You should ignore this part for now, and we can talk about it another time.
# Recode & filter (no edits needed)
recoded <-
diamonds %>%
filter(color=="D" | color=="J") %>%
mutate(col = as.character(color))
Basically, weāre going to do a bit of exploration of variables that impact cost of diamonds. Even if you havenāt used R before, you might be able to tell from the code that we start with the diamonds data, filtered (i.e.Ā restricted) our data set to only include the diamonds that are either color āDā or āJā, created a new categorical variable called ācolā, and stored the whole thing in a new data set called ārecodedā.
Statisticians should always know something about the data domain in order to be useful, if you donāt know anything about the subject area you need to at least learn some basics. Wikipedia is usually a good place to start: https://en.wikipedia.org/wiki/Diamond_color.
Always start with clear research questions. Our question for this exercise:
The purpose of the exploratory data analysis (EDA) is to learn as much as you can about your research question before doing any fancy statistical modeling. We basically want to try and answer the research question with EDA if possible, and then we use statistical models to formally accommodate variability in the data and calculate the uncertainty of our conclusions.
Use the R code chunk to calculate the mean price by color just like the article. Summarize your observations below the code chunk. Donāt forget, we did some data wrangling above and made a new data set called ārecoded.ā Use the ārecodedā data for the rest of this analysis.
mean(price ~ col, data = recoded)
## D J
## 3169.954 5323.818
Use the R code chunk to calculate the other summary statistics for price of each diamond color using favstats() just like the article. Summarize your observations below the code chunk.
favstats(price ~ col, data = recoded)
,
Make side-by-side boxplots using the R code shown in the article, and then use the mplot() function to generate 1-2 additional plots that show the data in a different way. Warning: mplot() is one of those functions that only works in the console⦠use mplot() to explore the data, and then select the gear icon in the upper left, choose āGraphics System: gg_formulaā, and click āShow Expressionā. This prints some code in the console that you can copy & paste into your RMarkdown document.
gf_boxplot(price ~ col, data = recoded)
gf_histogram(~ carat, data = recoded, binwidth = 0.19) %>%
gf_labs(title = "Number of Carats")
gf_violin(carat ~ ntiles(depth), data = recoded) %>%
gf_labs(title = "Carat vs Depth", caption = "")
gf_boxplot(y ~ cut, data = recoded) %>%
gf_labs(title = "", caption = "")
Hopefully you feel you have a sense of the answer to your research question. Now we use statisical models to formalize the results.
Weāre trying to compare two group means, so a two-sample t-test is a useful tool for that purpose. Reproduce the two-sample t-test from the article, and then write a conclusion based on the t-test below the code chunk.
t.test(price ~ col, data = recoded)
##
## Welch Two Sample t-test
##
## data: price by col
## t = -23.121, df = 4197.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2336.496 -1971.232
## sample estimates:
## mean in group D mean in group J
## 3169.954 5323.818
[conclusion from the t-test hereā¦]
Based on the data we see that there is a statistically significant difference between to the two groups and could conclude that J diamonds tend to cost more than D diamonds.
You could arrive at the same conclusion using a linear model. In fact, the t-test is a special case of a more general linear model, so thatās why the result is the same. Weāll talk more about that another time. Here,
msummary(lm(price ~ col, data = recoded))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3169.95 45.03 70.40 <2e-16 ***
## colJ 2153.86 83.18 25.89 <2e-16 ***
##
## Residual standard error: 3706 on 9581 degrees of freedom
## Multiple R-squared: 0.0654, Adjusted R-squared: 0.0653
## F-statistic: 670.4 on 1 and 9581 DF, p-value: < 2.2e-16
The world is often too complicated to be understood by studying one or two variables at a time. Color is certainly not the only variable that impacts the value of a diamond. You may have noticed in the wikipedia article that color is only one of the ā4 Cāsā that influence the value of a diamond.
Create a scatter plot of price vs carat (a measure of diamond weight) that colors each plotted point according the color of the diamond it represents. Write a few sentances below the code chunk to explain what youāve observed from this plot.
gf_point(price ~ carat, colour = ~ col, data = recoded)
We can see from the scatter plot that there are far more extreme values in the variable of number of carats. This accounts for the variability of the price in the J diamonds and the difference in price between the two groups. Looking at this scatter plot it appears that D diamonds are worth more per carat, but there are far more examples of J diamonds with 2 or more carats. This may explain why the J diamonds, on average, cost more than the D diamonds.
Lastly, use mplot() to explore the recode data further. Add a model (try linear & smooth), play with plot types, facets, key, scales, etc. There is absolutely no risk that youāll ābreakā something. Youāll make some good plots that way, but donāt be surprised if you go through a lot of ugly, senseless plots along the wayā¦
gf_point(price ~ carat, data = recoded) %>%
gf_lm() %>%
gf_labs(title = "Price by Carat", caption = "Linear Model")
gf_point(price ~ carat, data = recoded, color = ~ cut) %>%
gf_smooth() %>%
gf_theme(legend.position = "right") %>%
gf_labs(title = "Price per Carat & Cut", caption = "Smooth Model")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The last model in the article is a linear model for diamond price based on carat weight and color (i.e.Ā D vs J). Notice that we fit the model, store it in an object that we call mod, and then do the summary as a separate step. Can you see why R interprets this in the same way as the nested syntax from the previous use of msummary(...)?
msummary(recoded)
## carat cut color clarity depth
## Min. :0.2000 Fair : 282 D:6775 SI1 :2833 Min. :43.00
## 1st Qu.:0.4000 Good : 969 E: 0 VS2 :2428 1st Qu.:61.00
## Median :0.7000 Very Good:2191 F: 0 SI2 :1849 Median :61.90
## Mean :0.8056 Premium :2411 G: 0 VS1 :1247 Mean :61.75
## 3rd Qu.:1.0300 Ideal :3730 H: 0 VVS2 : 684 3rd Qu.:62.60
## Max. :5.0100 I: 0 VVS1 : 326 Max. :73.60
## J:2808 (Other): 216
## table price x y
## Min. :51.60 Min. : 335.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 976.5 1st Qu.: 4.73 1st Qu.: 4.740
## Median :57.00 Median : 2310.0 Median : 5.67 Median : 5.680
## Mean :57.52 Mean : 3801.1 Mean : 5.74 Mean : 5.743
## 3rd Qu.:59.00 3rd Qu.: 5084.5 3rd Qu.: 6.53 3rd Qu.: 6.530
## Max. :73.00 Max. :18710.0 Max. :10.74 Max. :10.540
##
## z col
## Min. :0.000 Length:9583
## 1st Qu.:2.930 Class :character
## Median :3.500 Mode :character
## Mean :3.545
## 3rd Qu.:4.030
## Max. :6.980
##
# fit linear model
mod <- lm(price ~ carat + col, data = recoded)
msummary(mod)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1900.95 28.36 -67.03 <2e-16 ***
## carat 7708.95 33.68 228.88 <2e-16 ***
## colJ -1734.08 36.86 -47.05 <2e-16 ***
##
## Residual standard error: 1457 on 9580 degrees of freedom
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8555
## F-statistic: 2.836e+04 on 2 and 9580 DF, p-value: < 2.2e-16
# Plot the regression model
plotModel(mod, system = "ggplot2")
What was your research question again? Restate it here and then write a paragraph or two abour your observations conclusions after studying these data.
Original Research Question:
How do diamond prices compare for āDā and āJā colored diamonds?
Summary of Conclusions:
When taking a surface look at the data by investigating means, we see that there is a significant difference between J diamonds and D diamonds, with the former being more expensive on average. However, we can also see that there is a much larger spread in he data of the J diamonds than that of the D diamonds. When we take a closer look at the data, we can see that the quality of the D diamonds are generally higher when we look at price per carat and instances of premium and ideal cuts. The difference between the average price in the two groups coes primarily through the number of instances of diamonds within the J group havig 2 or more carats. So, we can also conclude that J diamonds are more likely to have more carats than D diamonds, and consequently more expensive, but D diamonds are typically higher quality per carat.