Load packages and data

library(tidyverse) 
library(datasauRus)

Exercises

Exercise 1

The datasaurus_dozen data-set has 1840 rows and 3 columns. It contains the variables x, y, and dataset. The dataset variable identifies which of the 13 data-sets each point belongs to, and each data-set contains 142 observations.

Exercise 2

First let’s plot the data in the Dino data-set:

dino_data <- datasaurus_dozen %>%
  filter(dataset == "dino")

ggplot(data = dino_data, mapping = aes(x = x, y = y)) +
  geom_point()

And next calculate the correlation between x and y in this data-set:

dino_data %>%
  summarize(r = cor(x, y))
## # A tibble: 1 × 1
##         r
##     <dbl>
## 1 -0.0645

Exercise 3

Next, we explore the star data-set. We filter it from the datasaurus_dozen data-set, then plot x vs y and calculate the correlation coefficient.

star_data <- datasaurus_dozen %>%
  filter(dataset == "star")

ggplot(data = star_data, aes(x = x, y = y)) +
  geom_point()

The correlation for the star data-set is also close to 0, meaning there is almost no linear relationship between x and y in this data-set.

star_data %>%
  summarize(r = cor(x, y))
## # A tibble: 1 × 1
##         r
##     <dbl>
## 1 -0.0630

Exercise 4

Now we explore the circle data-set. We filter it from the datasaurus_dozen data-set, then plot x vs y and calculate the correlation coefficient.

# Step 1: filter circle data-set
circle_data <- datasaurus_dozen %>%
  filter(dataset == "circle")

# Step 2: plot x vs y
ggplot(data = circle_data, aes(x = x, y = y)) +
  geom_point()

# Step 3: Calculate correlation and clean output
r_circle <- circle_data %>%
  summarize(r = cor(x, y)) %>%
  pull(r)

paste("The correlation between x and y in the circle dataset is", round(r_circle, 4))
## [1] "The correlation between x and y in the circle dataset is -0.0683"

Exercise 5

ggplot(datasaurus_dozen, aes(x = x, y = y, color = dataset)) +
  geom_point() +
  facet_wrap(~ dataset, ncol = 3) +
  theme(legend.position = "none")

datasaurus_dozen %>%
  group_by(dataset) %>%
  summarize(r = cor(x, y)) %>%
  print(13)
## # A tibble:
## #   13 × 2
##    dataset   
##    <chr>     
##  1 away      
##  2 bullseye  
##  3 circle    
##  4 dino      
##  5 dots      
##  6 h_lines   
##  7 high_lines
##  8 slant_down
##  9 slant_up  
## 10 star      
## 11 v_lines   
## 12 wide_lines
## 13 x_shape   
## # ℹ 1 more
## #   variable:
## #   r <dbl>

Exercise 6

# The data-sets that appear most correlated are "slant_up" and "slant_down" because the points form nearly perfect straight lines.

datasaurus_dozen %>%
  filter(dataset %in% c("slant_up", "slant_down")) %>%
  group_by(dataset) %>%
  summarize(r = cor(x, y))
## # A tibble: 2 × 2
##   dataset          r
##   <chr>        <dbl>
## 1 slant_down -0.0690
## 2 slant_up   -0.0686