This lab assignment aims to reinforce your understanding of data
cleaning and descriptive analysis using the dplyr and
psych packages in R. You will apply these concepts through
practical exercises, focusing on using and stacking dplyr
functions with the %>% operator.
Complete each exercise by writing the necessary R code.
Ensure you use the %>% operator to chain multiple
dplyr functions together.
Interpret the results for each exercise.
Knit your R Markdown file to a PDF and submit it as per the submission instructions.
dplyrClean a dataset using various dplyr functions.
data <- data.frame(
participant_id = 1:10,
reaction_time = c(250, 340, 295, NA, 310, 275, 325, 290, 360, NA),
gender = c("M", "F", "F", "M", "M", "F", "M", "F", "M", "F"),
accuracy = c(95, 87, 92, 88, 94, 91, 85, 89, 93, NA)
)
print(data)## participant_id reaction_time gender accuracy
## 1 1 250 M 95
## 2 2 340 F 87
## 3 3 295 F 92
## 4 4 NA M 88
## 5 5 310 M 94
## 6 6 275 F 91
## 7 7 325 M 85
## 8 8 290 F 89
## 9 9 360 M 93
## 10 10 NA F NA
Clean the dataset by performing the following steps:
Remove rows with missing values.
Rename the reaction_time column to
response_time.
Create a new column performance_group based on
accuracy (High if accuracy >= 90, otherwise
Low).
Remove outliers from the response_time
column.
Relevel the performance_group column to set “Low” as
the reference level.
# Install the dplyr package (if not already installed)
if(!require(dplyr)){install.packages("dplyr", dependencies=TRUE)} remove_outliers <- function(data, column) {
# Calculate quartiles and IQR using tidy evaluation
Q1 <- quantile(pull(data, {{ column }}), 0.25, na.rm = TRUE)
Q3 <- quantile(pull(data, {{ column }}), 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
# Filter rows based on the calculated bounds
data %>%
filter({{ column }} >= lower_bound,
{{ column }} <= upper_bound)
} #create cleaned_data
cleaned_data <- data %>%
na.omit() %>%
rename(response_time = reaction_time) %>%
mutate(performance_group = ifelse(accuracy >= 90, "High", "Low")) %>%
remove_outliers(response_time) %>%
mutate(performance_group = relevel(factor(performance_group), ref = "Low"))
print(cleaned_data)## participant_id response_time gender accuracy performance_group
## 1 1 250 M 95 High
## 2 2 340 F 87 Low
## 3 3 295 F 92 High
## 4 5 310 M 94 High
## 5 6 275 F 91 High
## 6 7 325 M 85 Low
## 7 8 290 F 89 Low
## 8 9 360 M 93 High
Interpretation: Two rows of missing data were removed due to missing values. There were not outliers. A column was renamed response time from reaction time. Preformance group was added as a new column which categorized reaction times high to low.
psychGenerate descriptive statistics for a dataset.
describe()
function from the psych package.# Install the psych package (if not already installed)
if(!require(psych)){install.packages("psych", dependencies=TRUE)}## vars n mean sd median trimmed mad min max range skew
## participant_id 1 10 5.5 3.03 5.5 5.50 3.71 1 10 9 0.00
## hours 2 10 5.6 1.51 5.5 5.62 1.48 3 8 5 -0.08
## kurtosis se
## participant_id -1.56 0.96
## hours -1.18 0.48
psychCreate graphical summaries of a dataset using the psych
package.
experiment_data <- data.frame(
response_time = c(250, 340, 295, 310, 275, 325, 290, 360, 285, 310),
accuracy = c(95, 87, 92, 88, 94, 91, 85, 89, 93, 90),
age = c(23, 35, 29, 22, 30, 31, 27, 40, 24, 32)
)corPlot()
function.Interpretation: Describe the correlation coefficients displayed in the plot, indicating the strength and direction of relationships between variables.
Create pair panels using the pairs.panels()
function.
Interpretation: Significant negative correlation between accuracy and response time. Significant positive correlation between age and response time. Small negative correlation between accuracy and age. For pair panels, response time, accuracy, and age are all evenly spread out.Shows the same correlations as the corPlot.