Module 8: Regression and the Line of Best Fit
Workshop 8: Modeling Diamond Pricing
Social science is full of numeric variables, like voter turnout, percentage of votes for party X, income, unemployment rates, and rates of policy implementation or people affected. So how do we analyze the association between two numeric variables?
Today, we’re going to investigate a popular dataset on commerce. The ggplot2 package’s diamonds dataset contains 53,940 diamond sales gathered from the Loose Diamonds Search Engine in 2017. We’re going to examine a random sample of 1000 of these diamonds, saved as mydiamonds.csv. This dataset lets use investigate a popular question for consumers: Are diamonds’ size, measured by carat, actually related to their cost, measured by price? Let’s investigate using the techniques below.
0. Import Data
Load Data
In this dataset, each row is a diamond!
library(tidyverse) # for data wrangling
library(viridis) # for colors
library(broom) # for regression
library(moderndive) # for regression
# Save the diamonds dataset as an object in my environment
mydiamonds <- read_csv("mydiamonds.csv")View Data
# View first 3 rows of dataset
mydiamonds %>% head(3)| price | carat | cut |
|---|---|---|
| 4596 | 1.20 | Ideal |
| 1934 | 0.62 | Ideal |
| 4840 | 1.14 | Very Good |
Codebook
In this dataset, our variables mean:
price: price of diamond in US dollars (from$326 to $18,823!)carat: weight of the diamond (0.2 to 5.01 carats)cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
1. Review
We have several tools in our toolkit for measuring the association between two variables: (1) Scatterplots, (2) Correlation, and (3) Regression / Line of Best Fit (New!). Let’s investigate!
Select Topic
Scatterplots
First, we can visualize the relationship between 2 numeric variables using a scatterplot, putting one on the x-axis and one on the y-axis. In a scatterplot, each dot represents a row in our dataset.
So, we can visualize just five randomly selected dots, like this:
mydiamonds %>% # pipe from dataframe
sample_n(5) %>% # take a random sample
ggplot(mapping = aes(x = carat, y = price)) +
# Pro-tip: if you say, shape = 21,
# this lets us change both the fill and the outline color of the dot
geom_point(size = 5, shape = 21,
fill = "steelblue", color = "white") +
theme_classic(base_size = 30) Or we can visualize all the dots, like this:
mydiamonds %>% # just pipe directly from data.frame
ggplot(mapping = aes(x = carat, y = price)) +
# Pro-tip: if you say, shape = 21,
# this lets us change both the fill and the outline color of the dot
geom_point(size = 3, shape = 21,
fill = "white", color = "steelblue") +
theme_classic() We can see that there’s a strong, positive relationship. As carat increases, price increases to!
Correlation
We can measure the relationship between two numeric variables using Pearson’s r, the correlation coefficient! This statistic ranges from -1 to 0 to +1. -1 indicates the strongest possible negative relationship, 0 indicates no relationship, and 1 indicates the strongest possible positive relationship. That could help us learn (1) how strongly associated are they, and (2) how positive or negative is that association. The animation below shows the full range of possible correlations we might get.