Exercise 1

1.1

Create a new code chunk where you load the tidyverse package. In the chunk settings, suppress any output messages.

library(tidyverse)

1.2

The tibble df has 60 observations (rows) of variables (columns) group, gender, score1 and score2 (continuous scores from two tests). Each row represents one participant.

df

## # A tibble: 60 × 4
##    group gender score1 score2          
##    <int> <chr>   <dbl> <chr>           
##  1     2 F        18.7 14.7563711082321
##  2     1 M        20.1 15.1463059324341
##  3     2 F        17.4 19.0025387614538
##  4     1 M        18.7 15.5693261509451
##  5     2 F        18.5 16.7322250273729
##  6     1 999      16.9 16.4511010915052
##  7     2 M        20.4 15.1008590050657
##  8     1 F        20.3 15.191041952879 
##  9     1 F        19.4 13.9717194882152
## 10     2 M        21.2 22.6918520246433
## # ℹ 50 more rows

There is something to fix in three of the variables. Explore the data and describe what needs to be corrected.

Hint: You can use e.g. str(), distinct(), and summary() to explore the data.

str(df)

## tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
##  $ group : int [1:60] 2 1 2 1 2 1 2 1 1 2 ...
##  $ gender: chr [1:60] "F" "M" "F" "M" ...
##  $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
##  $ score2: chr [1:60] "14.7563711082321" "15.1463059324341" "19.0025387614538" "15.5693261509451" ...

distinct(df)

## # A tibble: 60 × 4
##    group gender score1 score2          
##    <int> <chr>   <dbl> <chr>           
##  1     2 F        18.7 14.7563711082321
##  2     1 M        20.1 15.1463059324341
##  3     2 F        17.4 19.0025387614538
##  4     1 M        18.7 15.5693261509451
##  5     2 F        18.5 16.7322250273729
##  6     1 999      16.9 16.4511010915052
##  7     2 M        20.4 15.1008590050657
##  8     1 F        20.3 15.191041952879 
##  9     1 F        19.4 13.9717194882152
## 10     2 M        21.2 22.6918520246433
## # ℹ 50 more rows

summary(df)

##      group        gender              score1         score2         
##  Min.   :1.0   Length:60          Min.   :14.17   Length:60         
##  1st Qu.:1.0   Class :character   1st Qu.:16.85   Class :character  
##  Median :1.5   Mode  :character   Median :17.61   Mode  :character  
##  Mean   :1.5                      Mean   :17.89                     
##  3rd Qu.:2.0                      3rd Qu.:19.01                     
##  Max.   :2.0                      Max.   :21.53

gender should be a factor instead of a character variable (likely with values like “M” and “F”) group is numeric (int) but seems to represent group labels (1 and 2), so it should be a factor. score2 should be numeric but is stored as character

df$score2 <- as.numeric(df$score2)
df$group <- as.factor(df$group)
summary(df)

##  group     gender              score1          score2     
##  1:30   Length:60          Min.   :14.17   Min.   :11.28  
##  2:30   Class :character   1st Qu.:16.85   1st Qu.:14.44  
##         Mode  :character   Median :17.61   Median :15.47  
##                            Mean   :17.89   Mean   :16.08  
##                            3rd Qu.:19.01   3rd Qu.:17.82  
##                            Max.   :21.53   Max.   :22.69

str(df)

## tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
##  $ group : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 2 1 1 2 ...
##  $ gender: chr [1:60] "F" "M" "F" "M" ...
##  $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
##  $ score2: num [1:60] 14.8 15.1 19 15.6 16.7 ...

Exercise 2

2.1

Make the corrections you described above.

df$score2 <- as.numeric(df$score2)
df$group <- as.factor(df$group)
df$gender[df$gender == 999] <- NA
df$gender <- as.factor(df$gender)
summary(df)

##  group   gender       score1          score2     
##  1:30   F   :28   Min.   :14.17   Min.   :11.28  
##  2:30   M   :28   1st Qu.:16.85   1st Qu.:14.44  
##         NA's: 4   Median :17.61   Median :15.47  
##                   Mean   :17.89   Mean   :16.08  
##                   3rd Qu.:19.01   3rd Qu.:17.82  
##                   Max.   :21.53   Max.   :22.69

str(df)

## tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
##  $ group : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 2 1 1 2 ...
##  $ gender: Factor w/ 2 levels "F","M": 1 2 1 2 1 NA 2 1 1 2 ...
##  $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
##  $ score2: num [1:60] 14.8 15.1 19 15.6 16.7 ...

2.2

Count observations by group and gender. Arrange by the number of observations (ascending).

df %>%
  count(group, gender) %>%
  arrange(n)

## # A tibble: 6 × 3
##   group gender     n
##   <fct> <fct>  <int>
## 1 2     <NA>       1
## 2 1     <NA>       3
## 3 1     F         13
## 4 1     M         14
## 5 2     M         14
## 6 2     F         15

Exercise 3

3.1

Create a new variable, score_diff, that contains the difference between score1 and score2.

df <- df %>%
  mutate(score_diff = score1 - score2)

3.2

Compute the means of score1, score2, and score_diff.

Hint: Like mutate(), summarise() can take multiple variables in one go.

df %>%
  summarise(
    mean_score1 = mean(score1, na.rm = TRUE),
    mean_score2 = mean(score2, na.rm = TRUE),
    mean_score_diff = mean(score_diff, na.rm = TRUE)
  )

## # A tibble: 1 × 3
##   mean_score1 mean_score2 mean_score_diff
##         <dbl>       <dbl>           <dbl>
## 1        17.9        16.1            1.82

3.3

Compute the means of score1, score2, and score_diff by gender.

df %>%
  group_by(gender) %>%
  summarise(
    mean_score1 = mean(score1, na.rm = TRUE),
    mean_score2 = mean(score2, na.rm = TRUE),
    mean_score_diff = mean(score_diff, na.rm = TRUE)
  )

## # A tibble: 3 × 4
##   gender mean_score1 mean_score2 mean_score_diff
##   <fct>        <dbl>       <dbl>           <dbl>
## 1 F             17.9        16.3            1.63
## 2 M             18.1        16.0            2.08
## 3 <NA>          16.4        15.0            1.34

Exercise 4

4.1

Using ggplot2, create a scatter plot with score1 on the x-axis and score2 on the y-axis.

ggplot(df, aes(x = score1, y = score2)) +
  geom_point() +
  labs(
    title = "Scatter Plot of Score1 vs Score2",
    x = "Score 1",
    y = "Score 2"
  ) +
  theme_minimal()

4.2

Continuing with the previous plot, colour the points based on gender.

Set the output figure width to 10 and height to 6.

ggplot(df, aes(x = score1, y = score2, color = gender)) +
  geom_point() +
  labs(
    title = "Scatter Plot of Score1 vs Score2 by Gender",
    x = "Score 1",
    y = "Score 2",
    color = "Gender"
  ) +
  theme_minimal()

```

Exercise 5

5.1

Add the author (your name) and date into the metadata section. Create a table of contents.

5.2

Knit your document to HTML by changing html_notebook to html_document in the metadata, and pressing Knit.

Week 2 Exercises

Wendan Qian

2025-04-06

Exercise 1

1.1

1.2

Exercise 2

2.1

2.2

Exercise 3

3.1

3.2

3.3

Exercise 4

4.1

4.2

Exercise 5

5.1

5.2