Take-Home Midterm Exam: Introductory Psychological Statistics

Replace “Your Name” with your actual name.

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Nominal: Categories without order. Example: Gender (male, female, non-binary).
Ordinal: Ordered categories, uneven intervals. Example: Anxiety levels on a Likert scale.
Interval: Equal intervals, no true zero. Example: IQ scores.
Ratio: Equal intervals, true zero. Example: Reaction time in milliseconds.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Scores on a depression inventory (0-63): Interval – Equal intervals, but no true zero (0 doesn’t mean no depression).
- Response time in milliseconds: Ratio – Equal intervals and a true zero (0 ms means no response).

Likert scale ratings of agreement (1-7): Ordinal – Ordered categories, but the intervals may not be exactly equal.
Diagnostic categories (e.g., ADHD or anxiety disorder): Nominal – Categories without a numerical order.
Age in years: Ratio – Equal intervals and a true zero (age cannot be negative).

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Random Error: Mistakes that happen by chance and don’t follow a pattern. - Example: In a memory test, a participant gets distracted by noise and forgets a word.
Systematic Error: Mistakes that happen the same way every time.
- Example: A memory test with blurry text makes it harder for everyone to read, leading to lower scores.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

Effect on Validity: Measurement errors can make the results less accurate. If stress or academic performance is not measured correctly the study might show a weak or false relationship.

Ways to Minimize Errors:
- Use well-tested surveys and tools to measure stress and grades.
- Train researchers to collect data the same way each time.
- Reduce distractions during tests or surveys.
- Double-check and clean data for mistakes.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

# Create a simulated dataset
set.seed(123)  # For reproducibility

# Number of participants
n <- 50

# Create the data frame
data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # We'll fill this in based on condition
)

# Make the experimental condition reduce anxiety more than control
data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

# Ensure anxiety doesn't go below 0
data$anxiety_post <- pmax(data$anxiety_post, 0)

# Add some missing values for realism
data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# View the first few rows of the dataset
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

describeBy(data[c("reaction_time", "accuracy")], group = data$condition)

## 
##  Descriptive statistics by group 
## group: Control
##               vars  n   mean    sd median trimmed   mad    min    max  range
## reaction_time    1 30 301.40 48.54 299.68  300.42 55.38 201.67 408.45 206.78
## accuracy         2 29  85.49  9.86  85.53   85.68  8.77  61.91 105.50  43.59
##                skew kurtosis   se
## reaction_time  0.14    -0.66 8.86
## accuracy      -0.15    -0.35 1.83
## ------------------------------------------------------------ 
## group: Experimental
##               vars  n   mean    sd median trimmed   mad    min    max  range
## reaction_time    1 17 295.75 38.37 288.49  295.61 43.74 215.67 377.94 162.27
## accuracy         2 19  88.06  8.20  88.32   87.76  9.86  74.28 106.87  32.59
##               skew kurtosis   se
## reaction_time 0.00    -0.27 9.31
## accuracy      0.45    -0.45 1.88

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

data <- data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post) %>%
  group_by(condition) %>%
  summarise(mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

print (data)

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

Control Group: 3.79 Experimental Group: 8.643

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

mean_rt <- 350  # Mean reaction time in ms
std_rt <- 75    # Standard deviation in ms

# a. Probability of reaction time > 450ms
p_gt_450 <- 1 - pnorm(450, mean = mean_rt, sd = std_rt)

# b. Probability of reaction time between 300ms and 400ms
p_300_to_400 <- pnorm(400, mean = mean_rt, sd = std_rt) - pnorm(300, mean = mean_rt, sd = std_rt)

# Display results
p_gt_450

## [1] 0.09121122

p_300_to_400

## [1] 0.4950149

(a) Probability of a reaction time greater than 450ms: 0.0912 (or 9.12%) (b) Probability of a reaction time between 300ms and 400ms: 0.4950 (or 49.50%)

Part 3: Data Cleaning and Manipulation

Question 5: Data Cleaning with dplyr

Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

Remove all rows with missing values and create a new dataset called clean_data.

clean_data <- data %>%
  na.omit()

head(clean_data)

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),  # Ensure accuracy is included
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # Placeholder
)

data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

data$anxiety_post <- pmax(data$anxiety_post, 0)

data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# I had to reload the data set because my code was not working so after recreating it I cleaned the data 
# Recreating clean_data
clean_data <- data %>%
  filter(complete.cases(.))

# Verifying columns again
colnames(clean_data)

## [1] "participant_id" "reaction_time"  "accuracy"       "gender"        
## [5] "condition"      "anxiety_pre"    "anxiety_post"

clean_data <- clean_data %>%
  mutate(performance_category = case_when(
    accuracy >= 90 ~ "High",
    accuracy >= 70 & accuracy < 90 ~ "Medium",
    accuracy < 70 ~ "Low",
    TRUE ~ NA_character_
  ))

# View the updated dataset
head(clean_data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      278.1420  76.06792   Male      Control   32.337399
## 2              2      316.5590  88.33903   Male      Control    3.712618
## 3              3      199.2895  89.11430   Male      Control   33.882217
## 4              4      310.5990  84.66964   Male      Control   21.120099
## 5              6      401.8787 110.71458   Male Experimental   22.638738
## 6              7      365.0588  82.94701 Female      Control   31.975720
##   anxiety_post performance_category
## 1    26.033569               Medium
## 2     3.630559               Medium
## 3    30.984813               Medium
## 4    19.173950               Medium
## 5    14.876373                 High
## 6    30.643407               Medium

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

mean_reaction_time <- mean(clean_data$reaction_time, na.rm = TRUE)

filtered_data <- clean_data %>%
  filter(condition == "Experimental" & reaction_time < mean_reaction_time)

head(filtered_data)

##   participant_id reaction_time accuracy gender    condition anxiety_pre
## 1             10      269.9247 95.24673   Male Experimental    21.87452
## 2             11      282.3977 93.17659 Female Experimental    16.25770
## 3             14      237.0676 75.54591 Female Experimental    38.79410
## 4             21      273.8544 90.10133 Female Experimental    23.28767
## 5             23      296.9589 75.03219 Female Experimental    38.69844
## 6             34      289.9609 73.64412   Male Experimental    20.40484
##   anxiety_post performance_category
## 1     13.25773                 High
## 2     10.10701                 High
## 3     29.86405               Medium
## 4     11.94372                 High
## 5     26.98841               Medium
## 6     11.32828               Medium

I cleaned the dataset by first removing any rows with missing values to ensure that all data was complete and usable. Then, I created a new variable to categorize participants based on their accuracy scores into High, Medium, and Low performance levels. After that, I calculated the overall mean reaction time and filtered the dataset to include only participants in the Experimental condition who had reaction times faster than this average.

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

# Your code here. Hint: first, with dplyr create a new dataset that selects only the numeric variable (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).

clean_data <- clean_data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post)

numeric_data <- clean_data %>%
  select(reaction_time, accuracy, anxiety_pre, anxiety_post, anxiety_change)

# Check if the selection worked
colnames(numeric_data)

## [1] "reaction_time"  "accuracy"       "anxiety_pre"    "anxiety_post"  
## [5] "anxiety_change"

cor_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

corPlot(cor_matrix, 
        main = "Correlation Plot of Psychological Variables", 
        scale = FALSE, 
        diag = FALSE, 
        cex = 0.7)  # Adjust text size

The plot shows that pre- and post-experiment anxiety are strongly related, and bigger reductions in anxiety lead to lower final scores. If reaction time and accuracy are negatively correlated, it suggests a speed-accuracy tradeoff, but a positive link would challenge that idea. If anxiety and accuracy aren’t related, it could mean anxiety doesn’t always hurt performance. These insights can help researchers explore ways to improve focus and reduce anxiety effects.

Part 5: Reflection and Application

Question 7: Reflection

Reflect on how the statistical concepts and R techniques covered in this course apply to psychological research:

Describe a specific research question in psychology that interests you. What type of data would you collect, what statistical analyses would be appropriate, and what potential measurement errors might you need to address?
How has learning R for data analysis changed your understanding of psychological statistics? What do you see as the biggest advantages and challenges of using R compared to other statistical software?

1. How does social media usage impact anxiety levels in college students? I would collect self-reported social media usage (hours per day) and anxiety scores (using a standardized scale like GAD-7). 2. Learning R has made me understand data manipulation, visualization, and statistical testing, making it a lot easier for me to work with data sets. The biggest advantage of R is it’s flexibility and powerful libraries which allow for advanced analyses and automation.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Publish your assignment to RPubs and submit the URL to canvas.