library(tidyverse)
library(datasauRus)
The datasaurus_dozen data-set has 1840 rows and 3
columns. It contains the variables x, y, and
dataset. The dataset variable identifies which
of the 13 data-sets each point belongs to, and each data-set contains
142 observations.
First let’s plot the data in the Dino data-set:
dino_data <- datasaurus_dozen %>%
filter(dataset == "dino")
ggplot(data = dino_data, mapping = aes(x = x, y = y)) +
geom_point()
And next calculate the correlation between x and
y in this data-set:
dino_data %>%
summarize(r = cor(x, y))
## # A tibble: 1 × 1
## r
## <dbl>
## 1 -0.0645
Next, we explore the star data-set. We filter it from
the datasaurus_dozen data-set, then plot x vs
y and calculate the correlation coefficient.
star_data <- datasaurus_dozen %>%
filter(dataset == "star")
ggplot(data = star_data, aes(x = x, y = y)) +
geom_point()
The correlation for the star data-set is also close to 0, meaning there is almost no linear relationship between x and y in this data-set.
star_data %>%
summarize(r = cor(x, y))
## # A tibble: 1 × 1
## r
## <dbl>
## 1 -0.0630
Now we explore the circle data-set. We filter it from
the datasaurus_dozen data-set, then plot x vs
y and calculate the correlation coefficient.
# Step 1: filter circle data-set
circle_data <- datasaurus_dozen %>%
filter(dataset == "circle")
# Step 2: plot x vs y
ggplot(data = circle_data, aes(x = x, y = y)) +
geom_point()
# Step 3: Calculate correlation and clean output
r_circle <- circle_data %>%
summarize(r = cor(x, y)) %>%
pull(r)
paste("The correlation between x and y in the circle dataset is", round(r_circle, 4))
## [1] "The correlation between x and y in the circle dataset is -0.0683"
ggplot(datasaurus_dozen, aes(x = x, y = y, color = dataset)) +
geom_point() +
facet_wrap(~ dataset, ncol = 3) +
theme(legend.position = "none")
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(r = cor(x, y)) %>%
print(13)
## # A tibble:
## # 13 × 2
## dataset
## <chr>
## 1 away
## 2 bullseye
## 3 circle
## 4 dino
## 5 dots
## 6 h_lines
## 7 high_lines
## 8 slant_down
## 9 slant_up
## 10 star
## 11 v_lines
## 12 wide_lines
## 13 x_shape
## # ℹ 1 more
## # variable:
## # r <dbl>
# The data-sets that appear most correlated are "slant_up" and "slant_down" because the points form nearly perfect straight lines.
datasaurus_dozen %>%
filter(dataset %in% c("slant_up", "slant_down")) %>%
group_by(dataset) %>%
summarize(r = cor(x, y))
## # A tibble: 2 × 2
## dataset r
## <chr> <dbl>
## 1 slant_down -0.0690
## 2 slant_up -0.0686