Welcome to the PSYC3361 coding W2 self test. The test assesses your ability to use the coding skills covered in the Week 2 online coding modules.
In particular, it assesses your ability to…
It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.
Your notes should also document the troubleshooting process you went through to arrive at the code that worked.
For each of the challenges below, the documentation is JUST AS IMPORTANT as the code.
Good luck!!
Jenny
I am going to use the tidyverse package, which contains
both the ggplot and dplyr packages, along with
here (which is useful in telling R where the data is).
library(tidyverse)
library(here)
The data is in .csv format so I am giong to use the read_csv() function. This call tells R to find the data “here” within the data folder and to make a new object called babynames
dino <- read_csv(here("data", "dino.csv"))
The dino dataset comes from a paper illustrating the importance of plotting your data. In each of these datasets, the mean and variance of x and y are identical and the two variables are correlated in the same way (R = -0.06). When plotted, however, each reveals a very different pattern
In looking at the dataframe, I can see that the dataset variable can be used to separate these plots. I am going to plot x on the x axis, y on the y axis, and use geom_point to get the dots. I can make ggplot make a separate plot for each dataset by using facet_wrap(). I can add geom_smooth() to get the regression line, but the default wasn’t looking good. I worked out that I needed to specify method = “lm” to get a line that resembled the one in the original. I also needed to extend the limits of the y axis using scale_y_continuous.
dino %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ dataset) +
scale_y_continuous(limits = c(0,100))
## `geom_smooth()` using formula 'y ~ x'
HINT: add some colour, play with palettes, try a different theme, add a title, subtitle, caption
Here I am playing with colour by colouring the points by their x value. Originally that just made them boring blue, but I found the scale_color_gradientn() function which makes them rainbow. I got rid of the nasty grey background y using theme_minimal and added a useful title and caption.
dino %>%
ggplot(aes(x = x, y = y, colour = x)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_gradientn(colours = rainbow(5)) +
facet_wrap(~ dataset) +
scale_y_continuous(limits = c(0,100)) +
theme_minimal() +
labs(title = "The M, variance, and correlation between x and y \n is the same for each of these datasets", caption = "TAKE HOME MESSAGE: plot your data")
## `geom_smooth()` using formula 'y ~ x'
Can you write code to show that the mean, variance, and correlation between x and y is the same for each of the datasets?? HINT: this is a group_by and summarise problem
dino %>%
group_by(dataset) %>%
summarise(meanx = mean(x), meany = mean(y),
varx = var(x), vary = var(y), cor_xy = cor(x, y))
## # A tibble: 13 × 6
## dataset meanx meany varx vary cor_xy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 281. 726. -0.0641
## 2 bullseye 54.3 47.8 281. 726. -0.0686
## 3 circle 54.3 47.8 281. 725. -0.0683
## 4 dino 54.3 47.8 281. 726. -0.0645
## 5 dots 54.3 47.8 281. 725. -0.0603
## 6 h_lines 54.3 47.8 281. 726. -0.0617
## 7 high_lines 54.3 47.8 281. 726. -0.0685
## 8 slant_down 54.3 47.8 281. 726. -0.0690
## 9 slant_up 54.3 47.8 281. 726. -0.0686
## 10 star 54.3 47.8 281. 725. -0.0630
## 11 v_lines 54.3 47.8 281. 726. -0.0694
## 12 wide_lines 54.3 47.8 281. 726. -0.0666
## 13 x_shape 54.3 47.8 281. 725. -0.0656