Welcome to the PSYC3361 coding W2 self test. The test assesses your ability to use the coding skills covered in the Week 2 online coding modules.

In particular, it assesses your ability to…

It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.

Your notes should also document the troubleshooting process you went through to arrive at the code that worked.

For each of the challenges below, the documentation is JUST AS IMPORTANT as the code.

Good luck!! Jenny

PS- if you get stuck have a look in the /images folder for inspiration

load the packages you need

library(tidyverse)

The tidyverse package is a collection of R packages designed for data science It includes ggplot2 for creating plots and readr for reading in data files Loading tidyverse gives us access to all the functions used throughout this document

read in the dino data

dino <- read_csv("data/dino.csv")

The read_csv() function from the readr package (part of tidyverse) reads a CSV file and stores it as a data frame Here we read in dino.csv from the data folder and save it as an object called dino The file path “data/dino.csv” tells R to look inside the data subfolder of the current project directory

reproduce this plot

The dino dataset comes from a paper illustrating the importance of plotting your data. In each of these datasets, the mean and variance of x and y are identical and the two variables are correlated in the same way (R = -0.06). When plotted, however, each reveals a very different pattern

dino <- read_csv("data/dino.csv")
## Rows: 1846 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): dataset
## dbl (2): x, y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(dino, aes(x = x, y = y)) +
  geom_point(size = 0.8) + 
  geom_smooth(method = "lm", se = FALSE, colour = "blue") + 
  facet_wrap(~ dataset) + 
  theme_bw() + 
  labs(x = "x", y = "y")
## `geom_smooth()` using formula = 'y ~ x'


To reproduce this plot, ggplot() is used to initialise the plot with dino as the data and x and y mapped to the axes using aes() geom_point() adds a scatter plot layer with points sized at 0.8 geom_smooth(method = “lm”, se = FALSE) adds a straight linear regression line to each panel with the confidence interval band turned off facet_wrap(~ dataset) splits the plot into separate panels, one for each dataset (i.e. each shape) theme_bw() applies a clean black and white theme labs() sets the axis labels

what can you do to make it prettier

HINT: add some colour, play with palettes, try a different theme, add a title, subtitle, caption

dino <- read_csv("data/dino.csv")
## Rows: 1846 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): dataset
## dbl (2): x, y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(dino, aes(x = x, y = y, colour = dataset)) +
  geom_point(size = 0.8) + 
  geom_smooth(method = "lm", se = FALSE, colour = "pink") + 
  facet_wrap(~ dataset) + 
  theme_dark() + 
  labs(x = "x", y = "y",
    title = "This is my title",
    subtitle = "This is my subtitle",
    caption = "This is my caption") +
  theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'

To make the plot more visually appealing, colour = dataset is added inside aes() so that each panel’s points are coloured by dataset name theme_dark() replaces the default theme with a dark background A title, subtitle, and caption are added inside labs() theme(legend.position = “none”) removes the legend since the panel labels already identify each dataset

Must connect the labs() arguements to the plot with + If not, they will appear as plain text rather than plot labels Once added as a proper layer using +, the labels will render correctly

extra challenge

Can you write code to show that the mean, variance, and correlation between x and y is the same for each of the datasets?? HINT: this is a group_by and summarise problem

dino %>%
  group_by(dataset) %>%
  summarise(mean_x = mean(x),
            mean_y = mean(y),
            var_x = var(x),
            var_y = var(y),
            correlation = cor(x, y))
## # A tibble: 13 × 6
##    dataset    mean_x mean_y var_x var_y correlation
##    <chr>       <dbl>  <dbl> <dbl> <dbl>       <dbl>
##  1 away         54.3   47.8  281.  726.     -0.0641
##  2 bullseye     54.3   47.8  281.  726.     -0.0686
##  3 circle       54.3   47.8  281.  725.     -0.0683
##  4 dino         54.3   47.8  281.  726.     -0.0645
##  5 dots         54.3   47.8  281.  725.     -0.0603
##  6 h_lines      54.3   47.8  281.  726.     -0.0617
##  7 high_lines   54.3   47.8  281.  726.     -0.0685
##  8 slant_down   54.3   47.8  281.  726.     -0.0690
##  9 slant_up     54.3   47.8  281.  726.     -0.0686
## 10 star         54.3   47.8  281.  725.     -0.0630
## 11 v_lines      54.3   47.8  281.  726.     -0.0694
## 12 wide_lines   54.3   47.8  281.  726.     -0.0666
## 13 x_shape      54.3   47.8  281.  725.     -0.0656

group_by(dataset) splits the data into groups, one per dataset, so that all subsequent calculations are performed separately for each shape summarise() then calculates summary statistics for each group mean() gives the average of x and y var() gives the variance cor() gives the correlation between x and y The output table shows that despite looking completely different when plotted, all 13 datasets have nearly identical means, variances, and a correlation of approximately ~0.06

The pipe operator %>% must be included at the end of each line to pass the result forward to the next function Missing a %>% will cause an “object not found” error because R will try to run summarise() without any data

knit your document to pdf