The new challenge is up!
Contest submissions: submit here.
For this contest, you will be using this dataset: https://osf.io/jfdtk/
The data set is from this publication: https://cdn.ymaws.com/www.psichi.org/resource/resmgr/journal_2019/24_2_crambletalvarez.pdf
A codebook can be found here: https://osf.io/bkd5m/
You can complete and submit R script for any of the four levels, but higher levels require the completion of lower levels (in other words, if you want to turn in Level 3 script, you’ll need to also do levels 1 and 2).
For tips on getting started with R, start here: http://bit.ly/psichirtutorial
here are the packages you will need. Use install.package() if you don’t already have these ones installed.
library(tidyverse)
library(here)
library(janitor)
library(naniar)
library(broom)
library(kableExtra)
The dataset is the same as from the March challenge. So practice reading it in again using read_csv, here, and clean names.
Then go to the code book and choose your 4 favourite psychologists. Mine favourites are developmental psychologists
Use select to create df with age and the recognition values for your favourite 4.
april <- read_csv(here("data", "April.csv")) %>%
clean_names()
## Parsed with column specification:
## cols(
## .default = col_double(),
## StartDate = col_character(),
## EndDate = col_character(),
## Specify_ethnicity = col_character(),
## University = col_character(),
## Class_specify = col_logical(),
## Major = col_character(),
## Nameofwomenclass = col_character()
## )
## See spec(...) for full column specifications.
april_fav4 <- april %>%
select(age, f2, f9, m25, m32)
The naniar package helps you visualise and deal with missing data. Use install.packages() if you don’t already have it.
Use vis_miss() to see how much of a problem missing avalues in the Age variable are
vis_miss(april_fav4)
To remove participants who have missing age values, use filter and is not equal to (!=) “NA”, to filter out the NA values
april_age4 <- april_fav4 %>%
filter(age != "NA")
Use mutate and if_else to make a new variable that creates age groups
https://dplyr.tidyverse.org/reference/if_else.html
april_group4 <- april_age4 %>%
mutate(agegroup = if_else(age > 40, "old",
if_else(age <= 20, "young", "middle"))) %>%
select(age, agegroup, 2:6)
Use group_by, summarise and n() to count how many participants fall into each of the 3 age groups
april_group4 %>%
group_by(agegroup) %>%
summarise(count = n())
## # A tibble: 3 x 2
## agegroup count
## <chr> <int>
## 1 middle 173
## 2 old 13
## 3 young 49
Use pivot_longer to make a single column with the psychologist in it.
april_long <- april_group4 %>%
pivot_longer(names_to = "psych", values_to = "score", f2:m32)
Use recode to replace values in that column with names
april_recode <- april_long %>%
mutate(psych_name = recode(psych, "f2" = "mary",
"f9" = "eleanor",
"m25" = "jean",
"m32" = "albert", .default = "NA"))
Use group_by and summarise to calculate mean scores by agegroup and psych_name and then
april_means <- april_recode %>%
group_by(agegroup, psych_name) %>%
summarise(mean = mean(score))
Plot using facet_wrap.
april_means %>%
ggplot(aes(x = agegroup, y = mean)) +
geom_col() +
facet_wrap(~ psych_name)
Are your age groups out of order? Use fct_relevel to make your columns be ordered young, middle, old, then plot again.
april_means$agegroup <- fct_relevel(april_means$agegroup, c("young","middle", "old"))
april_means %>%
ggplot(aes(x = agegroup, y = mean)) +
geom_col() +
facet_wrap(~ psych_name)
Go back to the df that you read in and create a new one that includes only age and male_score, get rid of NAs in age
april_age_male <- april %>%
select(age, male_score) %>%
filter(age != "NA")
Plot the relation between age and male_score as scatter plot (add geom_point and geom_smooth to get an idea of whether there is a linear relation)
april_age_male %>%
ggplot(aes(x = age, y = male_score)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Use lm funciton to predict male score from age
lm_model <- lm(male_score ~ age, data=april_age_male)
Use print() to get coefficients
print(lm_model)
##
## Call:
## lm(formula = male_score ~ age, data = april_age_male)
##
## Coefficients:
## (Intercept) age
## 1.640514 -0.001023
Use summary() to get all model information
summary(lm_model)
##
## Call:
## lm(formula = male_score ~ age, data = april_age_male)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.23705 -0.36303 0.04764 0.35331 1.04866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.640514 0.105157 15.601 <2e-16 ***
## age -0.001023 0.004161 -0.246 0.806
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4747 on 233 degrees of freedom
## Multiple R-squared: 0.0002594, Adjusted R-squared: -0.004031
## F-statistic: 0.06045 on 1 and 233 DF, p-value: 0.806
Use the broom package to tidy this ugly output into a df using the tidy() function and then kable() to make a nice table.
tidymodel <- tidy(lm_model)
kable(tidymodel)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.6405135 | 0.1051570 | 15.6006105 | 0.000000 |
age | -0.0010232 | 0.0041614 | -0.2458666 | 0.806002 |