Create a new code chunk where you load the tidyverse package. In the chunk settings, suppress any output messages.
library(tidyverse)
The tibble df has 60 observations (rows) of variables (columns) group, gender, score1 and score2 (continuous scores from two tests). Each row represents one participant.
df
## # A tibble: 60 × 4
## group gender score1 score2
## <int> <chr> <dbl> <chr>
## 1 2 F 18.7 14.7563711082321
## 2 1 M 20.1 15.1463059324341
## 3 2 F 17.4 19.0025387614538
## 4 1 M 18.7 15.5693261509451
## 5 2 F 18.5 16.7322250273729
## 6 1 999 16.9 16.4511010915052
## 7 2 M 20.4 15.1008590050657
## 8 1 F 20.3 15.191041952879
## 9 1 F 19.4 13.9717194882152
## 10 2 M 21.2 22.6918520246433
## # ℹ 50 more rows
There is something to fix in three of the variables. Explore the data and describe what needs to be corrected.
Hint: You can use e.g. str(), distinct(), and summary() to explore the data.
str(df)
## tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
## $ group : int [1:60] 2 1 2 1 2 1 2 1 1 2 ...
## $ gender: chr [1:60] "F" "M" "F" "M" ...
## $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
## $ score2: chr [1:60] "14.7563711082321" "15.1463059324341" "19.0025387614538" "15.5693261509451" ...
distinct(df)
## # A tibble: 60 × 4
## group gender score1 score2
## <int> <chr> <dbl> <chr>
## 1 2 F 18.7 14.7563711082321
## 2 1 M 20.1 15.1463059324341
## 3 2 F 17.4 19.0025387614538
## 4 1 M 18.7 15.5693261509451
## 5 2 F 18.5 16.7322250273729
## 6 1 999 16.9 16.4511010915052
## 7 2 M 20.4 15.1008590050657
## 8 1 F 20.3 15.191041952879
## 9 1 F 19.4 13.9717194882152
## 10 2 M 21.2 22.6918520246433
## # ℹ 50 more rows
summary(df)
## group gender score1 score2
## Min. :1.0 Length:60 Min. :14.17 Length:60
## 1st Qu.:1.0 Class :character 1st Qu.:16.85 Class :character
## Median :1.5 Mode :character Median :17.61 Mode :character
## Mean :1.5 Mean :17.89
## 3rd Qu.:2.0 3rd Qu.:19.01
## Max. :2.0 Max. :21.53
gender should be a factor instead of a character variable (likely with values like “M” and “F”) group is numeric (int) but seems to represent group labels (1 and 2), so it should be a factor. score2 should be numeric but is stored as character
df$score2 <- as.numeric(df$score2)
df$group <- as.factor(df$group)
summary(df)
## group gender score1 score2
## 1:30 Length:60 Min. :14.17 Min. :11.28
## 2:30 Class :character 1st Qu.:16.85 1st Qu.:14.44
## Mode :character Median :17.61 Median :15.47
## Mean :17.89 Mean :16.08
## 3rd Qu.:19.01 3rd Qu.:17.82
## Max. :21.53 Max. :22.69
str(df)
## tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
## $ group : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 2 1 1 2 ...
## $ gender: chr [1:60] "F" "M" "F" "M" ...
## $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
## $ score2: num [1:60] 14.8 15.1 19 15.6 16.7 ...
Make the corrections you described above.
df$score2 <- as.numeric(df$score2)
df$group <- as.factor(df$group)
df$gender[df$gender == 999] <- NA
df$gender <- as.factor(df$gender)
summary(df)
## group gender score1 score2
## 1:30 F :28 Min. :14.17 Min. :11.28
## 2:30 M :28 1st Qu.:16.85 1st Qu.:14.44
## NA's: 4 Median :17.61 Median :15.47
## Mean :17.89 Mean :16.08
## 3rd Qu.:19.01 3rd Qu.:17.82
## Max. :21.53 Max. :22.69
str(df)
## tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
## $ group : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 2 1 1 2 ...
## $ gender: Factor w/ 2 levels "F","M": 1 2 1 2 1 NA 2 1 1 2 ...
## $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
## $ score2: num [1:60] 14.8 15.1 19 15.6 16.7 ...
Count observations by group and gender. Arrange by the number of observations (ascending).
df %>%
count(group, gender) %>%
arrange(n)
## # A tibble: 6 × 3
## group gender n
## <fct> <fct> <int>
## 1 2 <NA> 1
## 2 1 <NA> 3
## 3 1 F 13
## 4 1 M 14
## 5 2 M 14
## 6 2 F 15
Create a new variable, score_diff, that contains the difference between score1 and score2.
df <- df %>%
mutate(score_diff = score1 - score2)
Compute the means of score1, score2, and score_diff.
Hint: Like mutate(), summarise() can take multiple variables in one go.
df %>%
summarise(
mean_score1 = mean(score1, na.rm = TRUE),
mean_score2 = mean(score2, na.rm = TRUE),
mean_score_diff = mean(score_diff, na.rm = TRUE)
)
## # A tibble: 1 × 3
## mean_score1 mean_score2 mean_score_diff
## <dbl> <dbl> <dbl>
## 1 17.9 16.1 1.82
Compute the means of score1, score2, and score_diff by gender.
df %>%
group_by(gender) %>%
summarise(
mean_score1 = mean(score1, na.rm = TRUE),
mean_score2 = mean(score2, na.rm = TRUE),
mean_score_diff = mean(score_diff, na.rm = TRUE)
)
## # A tibble: 3 × 4
## gender mean_score1 mean_score2 mean_score_diff
## <fct> <dbl> <dbl> <dbl>
## 1 F 17.9 16.3 1.63
## 2 M 18.1 16.0 2.08
## 3 <NA> 16.4 15.0 1.34
Using ggplot2, create a scatter plot with score1 on the x-axis and score2 on the y-axis.
ggplot(df, aes(x = score1, y = score2)) +
geom_point() +
labs(
title = "Scatter Plot of Score1 vs Score2",
x = "Score 1",
y = "Score 2"
) +
theme_minimal()
Continuing with the previous plot, colour the points based on gender.
Set the output figure width to 10 and height to 6.
ggplot(df, aes(x = score1, y = score2, color = gender)) +
geom_point() +
labs(
title = "Scatter Plot of Score1 vs Score2 by Gender",
x = "Score 1",
y = "Score 2",
color = "Gender"
) +
theme_minimal()
```
Add the author (your name) and date into the metadata section. Create a table of contents.
Knit your document to HTML by changing html_notebook to
html_document in the metadata, and pressing Knit.