April 2020 PsiChiR challenge

The new challenge is up!

Contest submissions: submit here.

For this contest, you will be using this dataset: https://osf.io/jfdtk/

The data set is from this publication: https://cdn.ymaws.com/www.psichi.org/resource/resmgr/journal_2019/24_2_crambletalvarez.pdf

A codebook can be found here: https://osf.io/bkd5m/

You can complete and submit R script for any of the four levels, but higher levels require the completion of lower levels (in other words, if you want to turn in Level 3 script, you’ll need to also do levels 1 and 2).

For tips on getting started with R, start here: http://bit.ly/psichirtutorial

Set up

Load packages

here are the packages you will need. Use install.package() if you don’t already have these ones installed.

library(tidyverse)
library(here)
library(janitor)
library(naniar)
library(broom)
library(kableExtra)

Read in the data

The dataset is the same as from the March challenge. So practice reading it in again using read_csv, here, and clean names.

Then go to the code book and choose your 4 favourite psychologists. Mine favourites are developmental psychologists

Use select to create df with age and the recognition values for your favourite 4.

april <- read_csv(here("data", "April.csv")) %>%
  clean_names()
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   StartDate = col_character(),
##   EndDate = col_character(),
##   Specify_ethnicity = col_character(),
##   University = col_character(),
##   Class_specify = col_logical(),
##   Major = col_character(),
##   Nameofwomenclass = col_character()
## )
## See spec(...) for full column specifications.
april_fav4 <- april %>%
  select(age, f2, f9, m25, m32)

Level 1: Remove participants with missing responses for the variable “Age.”

The naniar package helps you visualise and deal with missing data. Use install.packages() if you don’t already have it.

Use vis_miss() to see how much of a problem missing avalues in the Age variable are

vis_miss(april_fav4)

To remove participants who have missing age values, use filter and is not equal to (!=) “NA”, to filter out the NA values

april_age4 <- april_fav4 %>%
  filter(age != "NA")

Level 2: Provide frequencies of three different age groups: How many people are Less than or equal to 20? How many people between 21-40? How many people are Greater than 40?

Use mutate and if_else to make a new variable that creates age groups

https://dplyr.tidyverse.org/reference/if_else.html

april_group4 <- april_age4 %>%
  mutate(agegroup = if_else(age > 40, "old",
                    if_else(age <= 20, "young", "middle"))) %>%
  select(age, agegroup, 2:6) 

Use group_by, summarise and n() to count how many participants fall into each of the 3 age groups

april_group4 %>%
  group_by(agegroup) %>%
  summarise(count = n())
## # A tibble: 3 x 2
##   agegroup count
##   <chr>    <int>
## 1 middle     173
## 2 old         13
## 3 young       49

Level 3: Create 4 graphs to visualize recognition of any 4 psychologists by age of participant.

Use pivot_longer to make a single column with the psychologist in it.

april_long <- april_group4 %>%
  pivot_longer(names_to = "psych", values_to = "score", f2:m32)

Use recode to replace values in that column with names

april_recode <- april_long %>%
  mutate(psych_name = recode(psych, "f2" = "mary",
                          "f9" = "eleanor",
                          "m25" = "jean",
                       "m32" = "albert", .default = "NA"))

Use group_by and summarise to calculate mean scores by agegroup and psych_name and then

april_means <- april_recode %>%
  group_by(agegroup, psych_name) %>%
  summarise(mean = mean(score))

Plot using facet_wrap.

april_means %>%
  ggplot(aes(x = agegroup, y = mean)) +
  geom_col() +
  facet_wrap(~ psych_name) 

Are your age groups out of order? Use fct_relevel to make your columns be ordered young, middle, old, then plot again.

april_means$agegroup <- fct_relevel(april_means$agegroup, c("young","middle", "old"))

april_means %>%
  ggplot(aes(x = agegroup, y = mean)) +
  geom_col() +
  facet_wrap(~ psych_name) 

Level 4: Run a simple linear regression analysis to determine if you can predict recognition of male psychologists from participant age.

Go back to the df that you read in and create a new one that includes only age and male_score, get rid of NAs in age

april_age_male <- april %>%
  select(age, male_score) %>%
  filter(age != "NA")

Plot the relation between age and male_score as scatter plot (add geom_point and geom_smooth to get an idea of whether there is a linear relation)

april_age_male %>%
  ggplot(aes(x = age, y = male_score)) +
          geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Use lm funciton to predict male score from age

lm_model <- lm(male_score ~ age, data=april_age_male)

Use print() to get coefficients

print(lm_model)
## 
## Call:
## lm(formula = male_score ~ age, data = april_age_male)
## 
## Coefficients:
## (Intercept)          age  
##    1.640514    -0.001023

Use summary() to get all model information

summary(lm_model)
## 
## Call:
## lm(formula = male_score ~ age, data = april_age_male)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23705 -0.36303  0.04764  0.35331  1.04866 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.640514   0.105157  15.601   <2e-16 ***
## age         -0.001023   0.004161  -0.246    0.806    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4747 on 233 degrees of freedom
## Multiple R-squared:  0.0002594,  Adjusted R-squared:  -0.004031 
## F-statistic: 0.06045 on 1 and 233 DF,  p-value: 0.806

Use the broom package to tidy this ugly output into a df using the tidy() function and then kable() to make a nice table.

tidymodel <- tidy(lm_model)

kable(tidymodel)
term estimate std.error statistic p.value
(Intercept) 1.6405135 0.1051570 15.6006105 0.000000
age -0.0010232 0.0041614 -0.2458666 0.806002