Take-Home Midterm Exam: Introductory Psychological Statistics

Replace “Your Name” with your actual name.

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Nominal data includes categories that have no specific order. For example, different types of therapy like Cognitive Behavioral Therapy (CBT), Psychodynamic Therapy, or Humanistic Therapy are nominal—they are just labels without ranking. Ordinal data involves categories that do have an order, but the spacing between them isn’t necessarily equal. A common example is a Likert scale, where participants rate how much they agree with a statement like “I feel anxious often,” from “Strongly Disagree” to “Strongly Agree.” While the order matters, the difference between each level isn’t exact. Interval data includes numbers with equal spacing between values but no true zero point. For instance, IQ scores are interval data—the difference between scores is meaningful, but a score of zero doesn’t mean a total lack of intelligence. Ratio refers to data that has equal spacing and a true zero, meaning you can compare values in terms of “twice as much” or “half as much.” An example of this is reaction time in milliseconds during a psychological test. Zero means no reaction time at all, so the numbers can be compared using all types of math. These distinctions help researchers choose the right way to analyze their data.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Scores on a depression inventory (0–63) is interval data. The scores have equal intervals between values, meaning the difference between a score of 20 and 30 is the same as between 30 and 40. However, there is no true zero that represents a complete absence of depression.
Response time in milliseconds is ratio data. The time is measured on a numerical scale with equal intervals, and it has a true zero; zero milliseconds means no time has passed. Because of this, it’s possible to say one response was twice as fast as the other.
Likert scales of agreement(1-7) is ordinal. The numbers represent an order of ranking, but the exact difference between each may not be equal.
Diagnostic categories is nominal. These are categories and don’t have a ranking or order, meaning one diagnosis isn’t ranked above the other.
Age in years is ratio data.Age is numerical with eual intervals and a true zero, zero years meaning no age. This allows comparisons such as saying one person is twice as old as someone else.

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Random error is an unpredictable mistake that happens by chance and affects the accuracy of measurements in an inconsistent way. For example, in a memory experiment, some participants might be distracted by background noise while recalling words, causing occasional mistakes. These errors are random and usually balance out over time.

Systematic errors are a consistent, predictable mistake that affects measurements in the same way every time. In a memory experiment, this could happen if the experimenter gives one group more encouragement or if the word lists for different groups are too easy or hard. This type of error skews the results and makes them less accurate. Overall, random error makes results less reliable, while systematic error makes them less valid.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

Measurement error can affect the validity of a study by making the results less accurate and potentially misleading. If the tools used to measure stress or academic performance aren’t reliable, the study might not truly show how stress impacts performance. For example, if the stress levels of participants are measured poorly, it might seem like stress doesn’t affect academic performance, even though it actually does. To reduce these errors, researchers can use well-tested and reliable measurement tools, ensure that all participants are tested in the same way to avoid differences in procedures, and increase the sample size to reduce random errors. They can also pilot test their tools before the study to make sure they’re effective. These steps help make the study’s results more accurate and valid.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

# Create a simulated dataset
set.seed(123)  # For reproducibility

# Number of participants
n <- 50

# Create the data frame
data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # We'll fill this in based on condition
)

# Make the experimental condition reduce anxiety more than control
data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

# Ensure anxiety doesn't go below 0
data$anxiety_post <- pmax(data$anxiety_post, 0)

# Add some missing values for realism
data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# View the first few rows of the dataset
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

# Calculate descriptive statistics of reaction_time grouped by condition
describeBy(data$reaction_time, data$condition)

## 
##  Descriptive statistics by group 
## group: Control
##    vars  n  mean    sd median trimmed   mad    min    max  range skew kurtosis
## X1    1 30 301.4 48.54 299.68  300.42 55.38 201.67 408.45 206.78 0.14    -0.66
##      se
## X1 8.86
## ------------------------------------------------------------ 
## group: Experimental
##    vars  n   mean    sd median trimmed   mad    min    max  range skew kurtosis
## X1    1 17 295.75 38.37 288.49  295.61 43.74 215.67 377.94 162.27    0    -0.27
##      se
## X1 9.31

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

library(dplyr)
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

# Create new variable named anxiety change
data <- data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post)
# Calculate mean of anxiety change for each condition
data <- data %>% 
  group_by(condition) %>%
  summarize (mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

The mean for anxiety change in the control group is 3.79, and the mean for the experimental group is 8.64.

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

# (a)
1 - pnorm(450, mean = 350, sd = 75)

## [1] 0.09121122

# (b)
pnorm(400, mean = 350, sd = 75) - pnorm(300, mean = 350, sd = 75)

## [1] 0.4950149

The probability that a random selected participant will have a reaction time greater than 450ms is .09 or 9%.
The probability that a random participant will have a reaction time between 300 and 400ms is .495 or 50%

Part 3: Data Cleaning and Manipulation

Question 5: Data Cleaning with dplyr

Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

Remove all rows with missing values and create a new dataset called clean_data.

# Create data frame
data <- data.frame(
  participant_id = c(1, 2, 3, 4, 5, 6),
  reaction_time = c(271.9762, 288.4911, 377.9354, 303.5254, 306.4644, 385.7532),
  accuracy = c(87.53319, 84.71453, 84.57130, 98.68602, 82.74229, 100.16471),
  gender = c("Female", "Female", "Female", "Male", "Female", "Female"),
  condition = c("Control", "Experimental", "Experimental", "Control", "Control", "Control"),
  anxiety_pre = c(31.30191, 31.15234, 27.65762, 16.93299, 24.04438, 22.75684),
  anxiety_post = c(29.05312, 19.21510, 20.45306, 13.75199, 17.84736, 19.93397)
)

# Remove rows with missing values using na.omit()
clean_data <- na.omit(data)

# View the cleaned dataset
print(clean_data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

library(dplyr)

# Example data frame
data <- data.frame(
  participant_id = c(1, 2, 3, 4, 5, 6),
  reaction_time = c(271.9762, 288.4911, 377.9354, 303.5254, 306.4644, 385.7532),
  accuracy = c(87.53319, 84.71453, 84.57130, 98.68602, 82.74229, 100.16471),
  gender = c("Female", "Female", "Female", "Male", "Female", "Female"),
  condition = c("Control", "Experimental", "Experimental", "Control", "Control", "Control"),
  anxiety_pre = c(31.30191, 31.15234, 27.65762, 16.93299, 24.04438, 22.75684),
  anxiety_post = c(29.05312, 19.21510, 20.45306, 13.75199, 17.84736, 19.93397)
)

# Create performance_category based on accuracy
data <- data %>%
  mutate(performance_category = case_when(
    accuracy >= 90 ~ "High",       # Accuracy 90 and above
    accuracy >= 70 & accuracy < 90 ~ "Medium",  # Accuracy between 70 and 90
    accuracy < 70 ~ "Low"          # Accuracy below 70
  ))

# View the updated data
print(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post performance_category
## 1     29.05312               Medium
## 2     19.21510               Medium
## 3     20.45306               Medium
## 4     13.75199                 High
## 5     17.84736               Medium
## 6     19.93397                 High

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

library(dplyr)

# Example data frame
data <- data.frame(
  participant_id = c(1, 2, 3, 4, 5, 6),
  reaction_time = c(271.9762, 288.4911, 377.9354, 303.5254, 306.4644, 385.7532),
  accuracy = c(87.53319, 84.71453, 84.57130, 98.68602, 82.74229, 100.16471),
  gender = c("Female", "Female", "Female", "Male", "Female", "Female"),
  condition = c("Control", "Experimental", "Experimental", "Control", "Control", "Experimental"),
  anxiety_pre = c(31.30191, 31.15234, 27.65762, 16.93299, 24.04438, 22.75684),
  anxiety_post = c(29.05312, 19.21510, 20.45306, 13.75199, 17.84736, 19.93397)
)

# Calculate the overall mean reaction time
mean_reaction_time <- mean(data$reaction_time)

# Filter the dataset to include only participants in the Experimental condition with reaction times faster than the mean
filtered_data <- data %>%
  filter(condition == "Experimental", reaction_time < mean_reaction_time)

# View the filtered data
print(filtered_data)

##   participant_id reaction_time accuracy gender    condition anxiety_pre
## 1              2      288.4911 84.71453 Female Experimental    31.15234
##   anxiety_post
## 1      19.2151

To clean the data, I started by checking for any missing values using the is.na() function. Next, I created a new variable called performance_category to group participants based on their accuracy scores. I categorized them as “High” if their accuracy was 90 or above, “Medium” if it was between 70 and 90, and “Low” if it was below 70. I used mutate() along with case_when(). Then, I filtered the dataset to include only participants in the Experimental condition whose reaction times were faster than the overall mean. To do this, I first calculated the mean reaction time and then used filter() to narrow down the data.

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

# Example data
data <- data.frame(
  participant = c(1, 2, 3, 4, 5, 6),
  reaction_time = c(271.9762, 288.4911, 377.9354, 303.5254, 306.4644, 385.7532),
  accuracy = c(87.53319, 84.71453, 84.57130, 98.68602, 82.74229, 100.16471),
  gender = c("Female", "Female", "Female", "Male", "Female", "Female"),
  condition = c("Control", "Experimental", "Experimental", "Control", "Control", "Control"),
  anxiety_pre = c(31.30191, 31.15234, 27.65762, 16.93299, 24.04438, 22.75684),
  anxiety_post = c(29.05312, 19.21510, 20.45306, 13.75199, 17.84736, 19.93397)
)
# Create a new variable 
data <- data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post)
# Select only numeric variables
numeric_data <- data %>%
  select(reaction_time, accuracy, anxiety_pre, anxiety_post, anxiety_change)
# Create the correlation plot
 corPlot(cor(numeric_data))

## Error in plot.new(): figure margins too large

Looking at the correlation plot, variables like anxiety_pre and anxiety_post are likely to show a strong positive correlation, suggesting that anxiety levels before and after an event are closely related. A negative correlation is shown to lean towards reaction_time and accuracy, implying that faster reaction times could lead to less accuracy, or vice versa, depending on the specific task. If there’s a negative correlation between anxiety_change and accuracy, it could suggest that participants who experience a reduction in anxiety may perform less accurately, which might be unexpected. These correlations could provide valuable insights for future psychological research. For instance, they could inform studies exploring how anxiety affects cognitive performance, how reducing anxiety might improve accuracy, and how the speed-accuracy trade off influences decision-making. Further research could explore interventions to manage anxiety and improve performance, particularly in high-stress situations, or examine how cognitive load and anxiety impact performance in tasks requiring both speed and accuracy.

Part 5: Reflection and Application

Question 7: Reflection

Reflect on how the statistical concepts and R techniques covered in this course apply to psychological research:

Describe a specific research question in psychology that interests you. What type of data would you collect, what statistical analyses would be appropriate, and what potential measurement errors might you need to address?
How has learning R for data analysis changed your understanding of psychological statistics? What do you see as the biggest advantages and challenges of using R compared to other statistical software?

1. A research question that interests me is: How do early childhood experiences, including trauma and neglect, impact the development of criminal tendencies? I would collect both qualitative data (e.g., interviews) and quantitative data (e.g., surveys measuring trauma severity and criminal behavior). Statistical analyses like correlation and regression would help examine the relationship between trauma and criminal behavior, while path analysis could explore direct and indirect effects. Potential measurement errors include recall bias, social desirability bias, and sampling bias, so careful attention would be needed to ensure accurate and representative data collection.
1. Learning R for data analysis has helped with my understanding of psychological statistics, even though I initially struggled a lot with how new and different it was. The ability to manipulate data, perform advanced analyses, and create visualizations has given me valuable insights into how to work with complex datasets. R’s flexibility, reproducibility, and extensive range of packages are major advantages over other software. However, the programming aspect and syntax can be challenging for beginners. With these challenges though, I’ve learned a lot about data analysis, and R has helped a lot with me ability to conduct and interpret psychological research.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Publish your assignment to RPubs and submit the URL to canvas.