Take-Home Midterm Exam: Introductory Psychological Statistics

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Nominal data is categorical data used to classify variables in no order and without assigning numeric values. Personality types are an example of nominal data because they categorize individuals into distinct groups that don’t have a numerical or meaningful sequence. When labeling individuals as introverted or extroverted, the labels classify individuals based on unique characteristics; neither group is considered “greater” or “lesser” than the other in a way that would apply to a measurable scale such as temperature.

Ordinal data is categorical data with an order or ranking, but the intervals between them are not guaranteed to be consistent. An example of ordinal data is the Likert scale, which is used to evaluate attitudes or opinions by having participants provide their rating on a series of ranked choices, such as “Agree,” “Neutral,” and “Disagree.” The differences in agreement between these answers on the scale likely aren’t equal or consistent. There is a meaningful order, which makes it ordinal data.

Interval data is numerical data with equal intervals between values, allowing for precise variance measurements. When working with interval data, zero doesn’t denote the complete absence of the measured variable because this type of data lacks a zero point. IQ test scores are an example because the difference between scores is consistent and meaningful. An IQ score of 120 is the same distance away from 100 as an IQ score of 80, but a score of 0 doesn’t represent “no intelligence,” so it lacks a true zero.

Ratio data is numerical data with all the properties of interval data but with a true zero point, meaning zero indicates the absence of the measured quantity. Considering the time it takes for a participant to respond to a stimulus, a reaction time of 0 seconds represents no delay, and the data allows for meaningful comparisons between fast and slow when working with a reasonably reliable metric.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Scores on a depression inventory are ratio data because they have a true zero point, meaning a score of 0 indicates no depression. The differences between scores are consistent and meaningful comparisons can be made between scores.

Response time in milliseconds is ratio data because it has a true zero point (0 milliseconds means no time passed). Since the differences between times are consistent, we can make valid comparisons like 2x the average or 0.5x less.

Likert scale ratings are ordinal because they rank responses (e.g., “Agree” to “Disagree”), but the range between these responses isn’t guaranteed to be equal. The order matters, and we can’t assume the difference between levels is the same.

Diagnostic categories are nominal because they are used to label and categorize individuals; when analyzing data, they are functionally similar to personality types or eye colors.

Age in years is ratio data. It has a true zero point, which is before birth or death.

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Random errors are unpredictable variations that occur when measuring something. They happen because of small, uncontrollable factors like slight conditional changes such as a participant’s sleep quality. These errors do not follow a pattern and can make results higher or lower than the true value. Systematic errors are consistent mistakes that happen the same way every time and occur from a flaw in the method, equipment, and biases. Repeated measuring does not correct systematic errors, unlike random mistakes. The main distinction is that changing the sample size or increasing the measurements helps solve the less predictable random errors. By contrast, systematic mistakes call for addressing biases or flaws in the equipment or technique. Random errors can occur in a memory experiment, such as when a researcher tests memory by showing participants a series of images. Some participants may recall fewer images because they were tired or momentarily distracted. Others might not recall an image correctly because it reminded them of something similar. These errors happen by chance and do not follow a pattern. Over many trials, they balance out. There’s a systematic error if the researcher always presents images for too short a time. That makes it harder for all participants to process and remember them. Because this mistake happens the same way for everyone, it consistently lowers memory scores. Unlike random errors, systematic errors do not balance out and can lead to misleading results.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

Poorly written questionnaires can lead to misleading conclusions about correlations. Additionally, factors like health, family support, school environment, and more contribute to stress and academic performance and should be considered when drawing the most accurate conclusion. It would also be essential to consider where the sample originated; you will get different answers depending on which school the students attend. A study that extends research efforts beyond grades, working with a large sample size comprised of students from many schools and socioeconomic statuses, and even including surveys with physiological data would provide less room for error and improve the validity of the correlation. In establishing a correlation between any two variables, researchers must consider the impact of any external variables.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

set.seed(123) 

n <- 50

data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA
)

data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)  
)

data$anxiety_post <- pmax(data$anxiety_post, 0)

data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

describeBy(data[, c("reaction_time", "accuracy")], group = data$condition, mat = TRUE, digits = 2)

##                item       group1 vars  n   mean    sd median trimmed   mad
## reaction_time1    1      Control    1 30 301.40 48.54 299.68  300.42 55.38
## reaction_time2    2 Experimental    1 17 295.75 38.37 288.49  295.61 43.74
## accuracy1         3      Control    2 29  85.49  9.86  85.53   85.68  8.77
## accuracy2         4 Experimental    2 19  88.06  8.20  88.32   87.76  9.86
##                   min    max  range  skew kurtosis   se
## reaction_time1 201.67 408.45 206.78  0.14    -0.66 8.86
## reaction_time2 215.67 377.94 162.27  0.00    -0.27 9.31
## accuracy1       61.91 105.50  43.59 -0.15    -0.35 1.83
## accuracy2       74.28 106.87  32.59  0.45    -0.45 1.88

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post) %>% 
  group_by(condition) %>% 
  summarize(mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

The mean anxiety change for the control group is 3.79. The mean anxiety change for the experimental group is 8.64

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

mean_reaction_time <- 350
sd_reaction_time <- 75

z_450 <- (450 - mean_reaction_time) / sd_reaction_time
probability_greater_than_450 <- 1 - pnorm(450, mean = mean_reaction_time, sd = sd_reaction_time)

probability_between_300_and_400 <- pnorm(400, mean = mean_reaction_time, sd = sd_reaction_time) - pnorm(300, mean = mean_reaction_time, sd = sd_reaction_time)

probability_greater_than_450

## [1] 0.09121122

probability_between_300_and_400

## [1] 0.4950149

The probability that a randomly selected participant will have a reaction time greater than 450ms is 0.09. The probability that a participant will have a reaction time between 300ms and 400ms is 0.49

Part 3: Data Cleaning and Manipulation

Question 5: Data Cleaning with dplyr

Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

Remove all rows with missing values and create a new dataset called clean_data.

clean_data <- na.omit(data)

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

data <- data %>%
  mutate(
    performance_category = case_when(
      accuracy >= 90 ~ "High",        
      accuracy >= 70 & accuracy < 90 ~ "Medium",
      accuracy < 70 ~ "Low"          
    )
  )
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post performance_category
## 1     29.05312               Medium
## 2     19.21510               Medium
## 3     20.45306               Medium
## 4     13.75199                 High
## 5     17.84736               Medium
## 6     19.93397                 High

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

mean_reaction_time <- mean(clean_data$reaction_time, na.rm = TRUE)

filtered_data <- clean_data %>%
  filter(condition == "Experimental" & reaction_time < mean_reaction_time)
head(filtered_data)

##   participant_id reaction_time accuracy gender    condition anxiety_pre
## 1              2      288.4911 84.71453 Female Experimental    31.15234
## 2             15      272.2079 74.28209 Female Experimental    21.66514
## 3             24      263.5554 77.90799   Male Experimental    42.02762
## 4             26      215.6653 95.25571 Female Experimental    16.23203
## 5             32      285.2464 88.85280 Female Experimental    35.10548
## 6             38      296.9044 89.35181 Female Experimental    25.67790
##   anxiety_post
## 1     19.21510
## 2     16.64266
## 3     31.90485
## 4      8.05278
## 5     27.37644
## 6     16.42095

I removed rows with missing values to create a clean dataset and added a new variable (performance_category), categorizing participants based on their accuracy scores (High, Medium, Low). Finally, I filtered the dataset to include only participants in the experimental category with faster reaction time than the mean, providing a more structured dataset ready for further analysis.

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

numeric_data <- clean_data %>%
  select(reaction_time, accuracy, anxiety_pre, anxiety_post, anxiety_change)

## Error in `select()`:
## ! Can't select columns that don't exist.
## ✖ Column `anxiety_change` doesn't exist.

corPlot(cor(numeric_data, use = "pairwise.complete.obs"), 
        numbers = TRUE,  
        upper = FALSE, 
        main = "Correlation Plot of Key Variables")

## Error: object 'numeric_data' not found

Anxiety_pre and anxiety_post are the most highly correlated, which isn’t entirely unpredictable since anxiety levels before and after an experiment are anticipatable. Intriguingly, accuracy and reaction time have a very low correlation.