Replace “Your Name” with your actual name.

Objective:

This lab assignment aims to reinforce your understanding of data cleaning and descriptive analysis using the dplyr and psych packages in R. You will apply these concepts through practical exercises, focusing on using and stacking dplyr functions with the %>% operator.

Instructions:

  1. Complete each exercise by writing the necessary R code.

  2. Ensure you use the %>% operator to chain multiple dplyr functions together.

  3. Interpret the results for each exercise.

  4. Knit your R Markdown file to a PDF and submit it as per the submission instructions.

Homework Exercises:

Exercise 1: Cleaning Data with dplyr

Clean a dataset using various dplyr functions.

  1. Use the following dataset for the exercise:
data <- data.frame(
  participant_id = 1:10,
  reaction_time = c(250, 340, 295, NA, 310, 275, 325, 290, 360, NA),
  gender = c("M", "F", "F", "M", "M", "F", "M", "F", "M", "F"),
  accuracy = c(95, 87, 92, 88, 94, 91, 85, 89, 93, NA)
)

print(data)
##    participant_id reaction_time gender accuracy
## 1               1           250      M       95
## 2               2           340      F       87
## 3               3           295      F       92
## 4               4            NA      M       88
## 5               5           310      M       94
## 6               6           275      F       91
## 7               7           325      M       85
## 8               8           290      F       89
## 9               9           360      M       93
## 10             10            NA      F       NA
  1. Clean the dataset by performing the following steps:

    • Remove rows with missing values.

    • Rename the reaction_time column to response_time.

    • Create a new column performance_group based on accuracy (High if accuracy >= 90, otherwise Low).

    • Remove outliers from the response_time column.

    • Relevel the performance_group column to set “Low” as the reference level.

# Install the dplyr package (if not already installed)
if(!require(dplyr)){install.packages("dplyr", dependencies=TRUE)}
#Load Library
library(dplyr)
 remove_outliers <- function(data, column) {  
   # Calculate quartiles and IQR using tidy evaluation  
   Q1 <- quantile(pull(data, {{ column }}), 0.25, na.rm = TRUE)  
   Q3 <- quantile(pull(data, {{ column }}), 0.75, na.rm = TRUE)  
   IQR_val <- Q3 - Q1  
   lower_bound <- Q1 - 1.5 * IQR_val  
   upper_bound <- Q3 + 1.5 * IQR_val  
     
   # Filter rows based on the calculated bounds  
   data %>%  
     filter({{ column }} >= lower_bound,  
            {{ column }} <= upper_bound)  
 }  
#create cleaned_data
cleaned_data <- data %>%
  na.omit() %>% 
  rename(response_time = reaction_time) %>% 
  mutate(performance_group = ifelse(accuracy >= 90, "High", "Low")) %>% 
  remove_outliers(response_time) %>% 
  mutate(performance_group = relevel(factor(performance_group), ref = "Low"))
  

#Now, with the cleaned_data object write code to:
# Remove rows with missing values
# Rename the reaction_time column
# Create a new column performance_group
# Remove outliers from the response_time column
# Relevel the performance_group column


# View the cleaned data
print(cleaned_data)
##   participant_id response_time gender accuracy performance_group
## 1              1           250      M       95              High
## 2              2           340      F       87               Low
## 3              3           295      F       92              High
## 4              5           310      M       94              High
## 5              6           275      F       91              High
## 6              7           325      M       85               Low
## 7              8           290      F       89               Low
## 8              9           360      M       93              High

Interpretation: Describe the changes made to the dataset, such as the number of rows removed due to missing values, the new column created, and any outliers removed.

For the cleaned data set, we removed two rows that contained missing data. We renamed the “reaction_time” column to “response_time”. We created a new column called “performance_group” and gave each row a rating of high or low based on their accuracy, using low as the reference group. We checked for and removed any outliers in the data set, which there was none.

Exercise 2: Generating Descriptive Statistics with psych

Generate descriptive statistics for a dataset.

  1. Use the following dataset for the exercise:
study_hours <- data.frame(
  participant_id = 1:10,
  hours = c(5, 6, 4, 7, 5, 3, 8, 6, 5, 7)
)
  1. Generate descriptive statistics using the describe() function from the psych package.
# Install the psych package (if not already installed)
if(!require(psych)){install.packages("psych", dependencies=TRUE)}
#load the psych package
library(psych)
# Generate descriptive statistics
describe(study_hours)
##                vars  n mean   sd median trimmed  mad min max range  skew
## participant_id    1 10  5.5 3.03    5.5    5.50 3.71   1  10     9  0.00
## hours             2 10  5.6 1.51    5.5    5.62 1.48   3   8     5 -0.08
##                kurtosis   se
## participant_id    -1.56 0.96
## hours             -1.18 0.48
  • Interpretation: Explain the descriptive statistics obtained, such as the mean, standard deviation and skewness of the hours variable.

The mean is 5.6 with a standard deviation of 1.51. The median is 5.5, which tells me that there are no outliers because it is so close to the mean in value. The skew is slightly negative, but it is so close to 0 that the distribution is still even.

Exercise 3: Visualizing Data with psych

Create graphical summaries of a dataset using the psych package.

  1. Use the following dataset for the exercise:
experiment_data <- data.frame(
  response_time = c(250, 340, 295, 310, 275, 325, 290, 360, 285, 310),
  accuracy = c(95, 87, 92, 88, 94, 91, 85, 89, 93, 90),
  age = c(23, 35, 29, 22, 30, 31, 27, 40, 24, 32)
)
  • Create a correlation plot using the corPlot() function.
# Create the correlation plot
corPlot(cor(experiment_data))

  • Interpretation: Describe the correlation coefficients displayed in the plot, indicating the strength and direction of relationships between variables.

A correlation coefficient of -0.28 indicates a weak negative correlation between age and accuracy such that as age increases, accuracy decreases. A coefficient of -0.59 indicates a somewhat strong negative correlation between accuracy and response time. As response time increases, accuracy decreases A coefficient of 0.78 indicates a strong positive correlation between age and response time. As age increases, response time increases.

  • Create pair panels using the pairs.panels() function.
# Create the pair panels
experiment_data %>% 
  pairs.panels()

Interpretation: Explain the scatterplots, histograms, and correlation coefficients in the pair panels, highlighting any notable patterns or relationships.

The scatter plots are clearly showing the strong relationships between response time, accuracy, and age. The histograms show that the data is consistently even and spread out.

Submission Instructions:

Ensure to knit your document to PDF format, checking that all content is correctly displayed before submission. Submit this PDF to Canvas Assignments.