Replace “Your Name” with your actual name.
This lab assignment aims to reinforce your understanding of data
cleaning and descriptive analysis using the dplyr
and
psych
packages in R. You will apply these concepts through
practical exercises, focusing on using and stacking dplyr
functions with the %>%
operator.
Complete each exercise by writing the necessary R code.
Ensure you use the %>%
operator to chain multiple
dplyr
functions together.
Interpret the results for each exercise.
Knit your R Markdown file to a PDF and submit it as per the submission instructions.
dplyr
Clean a dataset using various dplyr
functions.
data <- data.frame(
participant_id = 1:10,
reaction_time = c(250, 340, 295, NA, 310, 275, 325, 290, 360, NA),
gender = c("M", "F", "F", "M", "M", "F", "M", "F", "M", "F"),
accuracy = c(95, 87, 92, 88, 94, 91, 85, 89, 93, NA)
)
print(data)
## participant_id reaction_time gender accuracy
## 1 1 250 M 95
## 2 2 340 F 87
## 3 3 295 F 92
## 4 4 NA M 88
## 5 5 310 M 94
## 6 6 275 F 91
## 7 7 325 M 85
## 8 8 290 F 89
## 9 9 360 M 93
## 10 10 NA F NA
Clean the dataset by performing the following steps:
Remove rows with missing values.
Rename the reaction_time
column to
response_time
.
Create a new column performance_group
based on
accuracy
(High if accuracy
>= 90, otherwise
Low).
Remove outliers from the response_time
column.
Relevel the performance_group
column to set “Low” as
the reference level.
# Install the dplyr package (if not already installed)
if(!require(dplyr)){install.packages("dplyr", dependencies=TRUE)}
remove_outliers <- function(data, column) {
# Calculate quartiles and IQR using tidy evaluation
Q1 <- quantile(pull(data, {{ column }}), 0.25, na.rm = TRUE)
Q3 <- quantile(pull(data, {{ column }}), 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
# Filter rows based on the calculated bounds
data %>%
filter({{ column }} >= lower_bound,
{{ column }} <= upper_bound)
}
#create cleaned_data
cleaned_data <- data %>%
na.omit() %>%
rename(response_time = reaction_time) %>%
mutate(performance_group = ifelse(accuracy >= 90, "High", "Low")) %>%
remove_outliers(response_time) %>%
mutate(performance_group = relevel(factor(performance_group), ref = "Low"))
#Now, with the cleaned_data object write code to:
# Remove rows with missing values
# Rename the reaction_time column
# Create a new column performance_group
# Remove outliers from the response_time column
# Relevel the performance_group column
# View the cleaned data
print(cleaned_data)
## participant_id response_time gender accuracy performance_group
## 1 1 250 M 95 High
## 2 2 340 F 87 Low
## 3 3 295 F 92 High
## 4 5 310 M 94 High
## 5 6 275 F 91 High
## 6 7 325 M 85 Low
## 7 8 290 F 89 Low
## 8 9 360 M 93 High
Interpretation: Describe the changes made to the
dataset, such as the number of rows removed due to missing values, the
new column created, and any outliers removed. In the cleaned dataset, we
removed 2 rows with missing data. There were no outliers so no outliers
were removed. We renamed a column from reaction time to response time.
We also added in a new column called performance group that categorized
participant accuary as low or high, with low as the reference group. ##
Exercise 2: Generating Descriptive Statistics with
psych
Generate descriptive statistics for a dataset.
describe()
function from the psych
package.# Install the psych package (if not already installed)
if(!require(psych)){install.packages("psych", dependencies=TRUE)}
## vars n mean sd median trimmed mad min max range skew
## participant_id 1 10 5.5 3.03 5.5 5.50 3.71 1 10 9 0.00
## hours 2 10 5.6 1.51 5.5 5.62 1.48 3 8 5 -0.08
## kurtosis se
## participant_id -1.56 0.96
## hours -1.18 0.48
hours
variable. The mean number of hours studied is 5.6
with a standard deviation of 1.51. The median is 5.5, which is very
similar to the mean. This means that we dont have any outliers. The
study hours ranged from 3 to 8 hours. There is a slightly negative skew,
but it is very close to zero implying that we have a normal
distribution. ## Exercise 3: Visualizing Data with
psych
Create graphical summaries of a dataset using the psych
package.
experiment_data <- data.frame(
response_time = c(250, 340, 295, 310, 275, 325, 290, 360, 285, 310),
accuracy = c(95, 87, 92, 88, 94, 91, 85, 89, 93, 90),
age = c(23, 35, 29, 22, 30, 31, 27, 40, 24, 32)
)
corPlot()
function.pairs.panels()
function.Interpretation: Explain the scatterplots, histograms, and correlation coefficients in the pair panels, highlighting any notable patterns or relationships. The histrogram show that the edata is fairly evenly spread out. The scatter plots are able to clearly show us the strong relationships between response time, accuracy, and age. Submission Instructions:
Ensure to knit your document to PDF format, checking that all content is correctly displayed before submission. Submit this PDF to Canvas Assignments.