Welcome to Introduction to R! This is where I am storing all my notes
about R from the PSYC3361 online modules. This document is intended to
be a master “cheat sheet” of sorts, so that all the commands are in one,
easily accessible place. Writing my notes in R Markdown is also a handy
way for me to actively practise what I’ve learned. Happy reading and
happy coding!
Hyperlinks
Images
Here is an example of a block quote!
nb: everything covered above in the Markdown formatting
guide is relevant to R Markdown!
print("hello world!")
## [1] "hello world!"
Welcome to the actual coding portion of R! Please note that R
Markdown can’t actually read data files, so these notes might be a
little disjointed or incomplete. I’d recommend using these notes to
interpret an actual R script, so that you can see the code in action
while also understanding the functions of each line of code.
When you code in R, there are three main parts that you will be
working with (in the following order):
You’ll see all this in action below!
NOTE THE FOLLOWING FOUNDATIONAL TERMINOLOGY
Given the line:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
A note about named arguments
ggplot(mpg, aes(displ,hwy))
What is Tidyverse?
#load packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#read the data
frames <- read_csv(file = "C:\\Users\\Alyss\\Downloads\\data_reasoning.csv")
## Rows: 4725 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gender, condition, sample_size
## dbl (5): id, age, n_obs, test_item, response
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the above script, we have:
Interpreting the console output
Printing and Glimpsing your Data
print(frames)
## # A tibble: 4,725 × 8
## id gender age condition sample_size n_obs test_item response
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 male 36 category small 2 1 8
## 2 1 male 36 category small 2 2 7
## 3 1 male 36 category small 2 3 6
## 4 1 male 36 category small 2 4 6
## 5 1 male 36 category small 2 5 5
## 6 1 male 36 category small 2 6 6
## 7 1 male 36 category small 2 7 3
## 8 1 male 36 category medium 6 1 9
## 9 1 male 36 category medium 6 2 7
## 10 1 male 36 category medium 6 3 5
## # ℹ 4,715 more rows
glimpse(frames)
## Rows: 4,725
## Columns: 8
## $ id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "m…
## $ age <dbl> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36…
## $ condition <chr> "category", "category", "category", "category", "category"…
## $ sample_size <chr> "small", "small", "small", "small", "small", "small", "sma…
## $ n_obs <dbl> 2, 2, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, …
## $ test_item <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ response <dbl> 8, 7, 6, 6, 5, 6, 3, 9, 7, 5, 6, 4, 4, 2, 8, 7, 6, 6, 4, 1…
Why are Data Summaries important?
#summarise my data
data_summary <- frames %>%
group_by(test_item, condition, sample_size) %>%
summarise(
mean_resp = mean(response),
sd_resp = sd(response)
) %>%
ungroup()
## `summarise()` has grouped output by 'test_item', 'condition'. You can override
## using the `.groups` argument.
In the above script, we have:
Writing the summary data to a new file
#write summary to file
write_csv(data_summary, file = "data_summary.csv")
#print the summary
print(data_summary)
## # A tibble: 42 × 5
## test_item condition sample_size mean_resp sd_resp
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 category large 7.60 2.36
## 2 1 category medium 7.32 2.49
## 3 1 category small 6.07 2.82
## 4 1 property large 7.16 2.23
## 5 1 property medium 6.66 2.40
## 6 1 property small 5.78 2.57
## 7 2 category large 7.51 2.01
## 8 2 category medium 7.17 1.99
## 9 2 category small 6.26 2.28
## 10 2 property large 7.20 1.84
## # ℹ 32 more rows
In the above script, we have:
What is ggplot?
#load packages
library(tidyverse)
#visualise mpg data in a scatterplot
picture <- ggplot(data = mpg) +
geom_point(
mapping = aes(
x = displ,
y = hwy,
color = cyl
# color = factor(cyl)
),
# color = "purple"
size = 4
)+
geom_smooth(
mapping = aes(
x = displ,
y = hwy,
)
) +
geom_rug(
mapping = aes(
x = displ,
y = hwy,
)
)
#print the ggplot object
print(picture)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In the above script, we have:
loaded the tidyverse package
used the ggplot(…) function to prep data for visualisation (BASE LAYER)
used the geom_point(mapping = …) function to visualise mpg data
in a scatterplot (FIRST LAYER)
used the aes(…) function to provide the scatterplot aesthetics
used the geom_rug(mapping = …) function to add lines to the axes
to further visualise each datapoint (THIRD LAYER)
summarised all this in a variable named “picture” by using
“picture <-”
printed “picture” (i.e. printed the scatterplot)
Global and Local Mappings
#visualise mpg data in a scatterplot
global_picture <- ggplot(data = mpg,
mapping = aes(x = displ, y = hwy)
) +
geom_point(mapping = aes(color = factor(cyl)), size = 4) +
geom_smooth() +
geom_rug()
#print the ggplot object
print(global_picture)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#read the data_forensic csv
data_forensic <- read_csv(file = "C:\\Users\\Alyss\\Downloads\\data_forensic.csv")
## Rows: 5700 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): handwriting_expert, us, condition, forensic_scientist, forensic_spe...
## dbl (7): participant, age, handwriting_reports, confidence, familiarity, est...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#visualise data_forensic in a box and whisker plot
plot_1 <- ggplot(data_forensic) + geom_boxplot(aes(band, est))
#draw plots
print(plot_1)
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
#read the data_forensic csv
data_forensic <- read_csv(file = "C:\\Users\\Alyss\\Downloads\\data_forensic.csv")
## Rows: 5700 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): handwriting_expert, us, condition, forensic_scientist, forensic_spe...
## dbl (7): participant, age, handwriting_reports, confidence, familiarity, est...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#visualise data_forensic in a violin plot
plot_2 <- ggplot(data_forensic) + geom_violin(aes(band, est))
#draw plots
print(plot_2)
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
#read the data_forensic csv
data_forensic <- read_csv(file = "C:\\Users\\Alyss\\Downloads\\data_forensic.csv")
## Rows: 5700 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): handwriting_expert, us, condition, forensic_scientist, forensic_spe...
## dbl (7): participant, age, handwriting_reports, confidence, familiarity, est...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#visualise data_forensic in a column plot
plot_3 <- ggplot(data_forensic) + geom_col(aes(band, est))
#draw plots
print(plot_3)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_col()`).
#create facets
by_expertise <- ggplot(data_forensic) +
geom_boxplot(aes(band, est)) +
facet_wrap(vars(handwriting_expert))
#print facets
print(by_expertise)
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In the above script, we have:
The below are just some useful methods to clean up and beautify your
plots. We will use the above facets as the plot to beautify.
Aesthetics
Parameters
pic <- ggplot(
data = data_forensic
) +
geom_boxplot(
mapping = aes(
x = band,
y = est,
fill = band
)
) +
facet_wrap(
vars(handwriting_expert)
) +
theme_minimal() +
scale_x_discrete(
name = NULL, #(name refers to the axis title)
labels = NULL #(labels refer to the axis labels)
) +
scale_y_discrete(
name = "Estimated Probability"
) +
ggtitle(
label = "Handwriting estimates for experts and novices",
subtitle = "Source: Matire et al."
) +
scale_fill_viridis_d(
alpha = .5,
name = NULL
)
print(pic)
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Data wrangling is the process of cleaning up your raw data so
that it is in an analysable and interpretable format. Think turning raw
keypress data into accuracy data. In other words, it is the
preprocessing of data before data is analysed, and involves cleaning and
filtering your data, and computing important variables.
dplyr is a package within tidyverse which we
will use for data wrangling.
#Import the SWOW data
library(tidyverse)
swow <- read_tsv(file = "C:\\Users\\Alyss\\Downloads\\data_swow.csv.zip")
## Multiple files in zip: reading 'swow.csv'
## Rows: 483636 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): cue, response
## dbl (3): R1, N, R1.Strength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
swow <- swow %>% mutate(id = 1:n())
print(swow)
## # A tibble: 483,636 × 6
## cue response R1 N R1.Strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 a one 21 97 0.216 1
## 2 a the 16 97 0.165 2
## 3 a b 9 97 0.0928 3
## 4 a an 4 97 0.0412 4
## 5 a first 3 97 0.0309 5
## 6 a letter 3 97 0.0309 6
## 7 a alphabet 2 97 0.0206 7
## 8 a apple 2 97 0.0206 8
## 9 a article 2 97 0.0206 9
## 10 a bat 2 97 0.0206 10
## # ℹ 483,626 more rows
#manual variable name cleaning new name = old name (case sensitive)
swow <- swow %>%
rename(n_response = R1,
n_total = N,
strength = R1.Strength)
print(swow)
## # A tibble: 483,636 × 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 a one 21 97 0.216 1
## 2 a the 16 97 0.165 2
## 3 a b 9 97 0.0928 3
## 4 a an 4 97 0.0412 4
## 5 a first 3 97 0.0309 5
## 6 a letter 3 97 0.0309 6
## 7 a alphabet 2 97 0.0206 7
## 8 a apple 2 97 0.0206 8
## 9 a article 2 97 0.0206 9
## 10 a bat 2 97 0.0206 10
## # ℹ 483,626 more rows
In the above script, we have:
#filtering for response = woman, response was given by more than one person
woman_bck <- swow %>%
filter(response == "woman", n_response >1) %>%
arrange(desc(strength)) %>% #decreasing strength
select(cue, response, strength, id)
#alternatively you can also filter out irrelevant variables:
#select(-n_response, -n_total)
#select(-starts_with("n_"))
print(woman_bck)
## # A tibble: 200 × 4
## cue response strength id
## <chr> <chr> <dbl> <int>
## 1 man woman 0.576 258593
## 2 lady woman 0.36 240149
## 3 feminist woman 0.303 158641
## 4 female woman 0.232 158492
## 5 pregnant woman 0.18 327286
## 6 housewife woman 0.17 209394
## 7 vagina woman 0.17 459047
## 8 dame woman 0.167 105474
## 9 menopause woman 0.16 266238
## 10 uterus woman 0.16 458533
## # ℹ 190 more rows
In the above script, we have:
#Computing a new "rank" variable which ranks "strength"
#Also creating new values for data that we already have (but just making the table cleaner and more readable)
woman_bck <- swow %>%
filter(response == "woman", n_response >1) %>%
arrange(desc(strength)) %>% #decreasing strength
select(-starts_with("n_")) %>%
mutate(rank = rank(-strength),
type = "backward",
word = response,
associate = cue
)
print(woman_bck)
## # A tibble: 200 × 8
## cue response strength id rank type word associate
## <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <chr>
## 1 man woman 0.576 258593 1 backward woman man
## 2 lady woman 0.36 240149 2 backward woman lady
## 3 feminist woman 0.303 158641 3 backward woman feminist
## 4 female woman 0.232 158492 4 backward woman female
## 5 pregnant woman 0.18 327286 5 backward woman pregnant
## 6 housewife woman 0.17 209394 6.5 backward woman housewife
## 7 vagina woman 0.17 459047 6.5 backward woman vagina
## 8 dame woman 0.167 105474 8 backward woman dame
## 9 menopause woman 0.16 266238 9.5 backward woman menopause
## 10 uterus woman 0.16 458533 9.5 backward woman uterus
## # ℹ 190 more rows
In the above script, we have: