Take-Home Midterm Exam: Introductory Psychological Statistics

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Nominal data is a from of data that is named and separated into categories that are in no apparent order; they are labels and names only and can not be ranked. An example of this from psychological research could be a research experiment interested in observing the differences between gender in hobbies. The nominal data is male or female.

Ordinal data is data that has a meaningful order, but the distance between values are not equal or known. An example of this can be a mental health assessment where questions ask you to fill in if you “Strongly agree, Agree slightly, Neutral, Disagree slightly, or Strongly disagree”.

Interval data is numeric data that has known, equal intervals between values, and there is no true zero. An example of this could bean IQ test, where the difference between scores (like 125 and 130, and 130 and 135) is consistent, but 0 does not mean zero intelligence levels.

Last, ratio data is numeric data that has equal intervals and a true zero point. An example of this could be a psychological research experiment where they are keeping track of how many errors made with instructive tasks. If a participant makes 0 errors, they have made truly zero errors. If a participant makes 2 errors and another makes 4, that person will have made exactly 2 more errors than the other.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Scores on a depression inventory (0-63): This is an example of interval data because scores on this inventory are numerical, have equal intervals between values, but the zero would not mean a total absence of depression.

Response time in milliseconds: This is an example of ratio data because response time is a numerical variable with equal intervals and 0 would truly mean zero time passed.

Likert scale ratings of agreement (1-7): This is an example of ordinal data should be used here, becuase the numbers on the Likert scale shows ranked variables, like “Strongly disagree”, but the difference between variables is not necessarily equal or clear.

Diagnostic categories (e.g. ADHD, anxiety disorder, no diagnosis): This is an example of nominal data is being used because the variables are categorical by diagnosis. There is no order and one doesn’t mean “more” than the other.

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

A random error is an error that happens due to unpredictable and uncontrollable factors that do not have a pattern. It affects the results of the study. An example of this could be a participant taking part in a research experiment studying memory recall, and there being a distraction, like a beeping fire alarm in the room. It is an unaccounted for factor that negatively affects the results of the studying. A systematic error refers to an error in a study that provides consistent and repeated errors that are due to a flaw in the study’s measurement system. They will often yield the same direction of errored results every time. An example of this is in a research study measuring time in seconds on a handheld stopwatch, and the stopwatch is a few milliseconds late. This error in the measurement system will provide continuous wrong answers to the study.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

In an example of a study examining the relationship between stress and academic performance, measurement errors could affect the study in different ways. In my experience, surveys on this topic would be emailed to the teacher for them to distribute during class time for the students to take then. Often times, students will be distracted by peers or worry about others seeing their answers. This would be classified as a random error that could affect the study. An example of a systematic error that could occur is framing questions in a black and white way- one just assesses test anxiety, not other avenues of anxiety or their sources. A way researchers could minimize these errors are making sure or encouraging students and participants take this in a private and comfortable area, free of distractions. Additionally, carefully crafting questions with input from real people who experience test anxiety can help enhance the assessments accuracy.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

# Create a simulated dataset
set.seed(123)  # For reproducibility

# Number of participants
n <- 50

# Create the data frame
data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # We'll fill this in based on condition
)

# Make the experimental condition reduce anxiety more than control
data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

# Ensure anxiety doesn't go below 0
data$anxiety_post <- pmax(data$anxiety_post, 0)

# Add some missing values for realism
data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# View the first few rows of the dataset
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

# describeBy(
  data[, c("reaction_time", "accuracy")]  # select columns

##    reaction_time  accuracy
## 1       271.9762  87.53319
## 2       288.4911  84.71453
## 3       377.9354  84.57130
## 4       303.5254  98.68602
## 5       306.4644  82.74229
## 6       385.7532 100.16471
## 7       323.0458  69.51247
## 8       236.7469  90.84614
## 9             NA  86.23854
## 10      277.7169  87.15942
## 11            NA  88.79639
## 12      317.9907  79.97677
## 13      320.0386  81.66793
## 14      305.5341  74.81425
## 15      272.2079  74.28209
## 16            NA  88.03529
## 17      324.8925  89.48210
## 18      201.6691  85.53004
## 19      335.0678  94.22267
## 20      276.3604 105.50085
## 21      246.6088  80.08969
## 22      289.1013  61.90831
## 23      248.6998  95.05739
## 24      263.5554  77.90799
## 25      268.7480  78.11991
## 26      215.6653  95.25571
## 27      341.8894  82.15227
## 28      307.6687  72.79282
## 29      243.0932  86.81303
## 30      362.6907        NA
## 31      321.3232  85.05764
## 32      285.2464  88.85280
## 33      344.7563  81.29340
## 34      343.9067  91.44377
## 35      341.0791  82.79513
## 36      334.4320  88.31782
## 37      327.6959  95.96839
## 38      296.9044  89.35181
## 39      284.7019  81.74068
## 40      280.9764  96.48808
## 41      265.2647  94.93504
## 42      289.6041  90.48397
## 43      236.7302        NA
## 44      408.4478  78.72094
## 45      360.3981  98.60652
## 46      243.8446  78.99740
## 47      279.8558 106.87333
## 48      276.6672 100.32611
## 49      338.9983  82.64300
## 50      295.8315  74.73579

  group = data$condition                  # group by 'condition'
  mat = TRUE                               # return as a data frame

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

# # Create anxiety_change and calculate mean by condition
data <- data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post)
# Mean anxiety change for each condition
data %>%
  group_by(condition) %>%
  summarise(mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

# Your code here
```# Given values
mean_rt <- 350
sd_rt <- 75

# a. Probability that reaction time > 450ms
p_greater_450 <- 1 - pnorm(450, mean = mean_rt, sd = sd_rt)

# b. Probability that reaction time is between 300ms and 400ms
p_between_300_400 <- pnorm(400, mean = mean_rt, sd = sd_rt) - pnorm(300, mean = mean_rt, sd = sd_rt)

# Print the results
p_greater_450
p_between_300_400


**Write your answer(s) here**


---

## Part 3: Data Cleaning and Manipulation 

### Question 5: Data Cleaning with dplyr
Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

1. Remove all rows with missing values and create a new dataset called `clean_data`.

install.packages("dplyr")
# Load the dplyr package
library(dplyr)

# Remove rows with missing values and create a new dataset called clean_data
clean_data <- na.omit(data)

# View the first few rows of the cleaned dataset
head(clean_data)

## Error in parse(text = input): attempt to use zero-length variable name

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

library(dplyr)

clean_data <- na.omit(data)  # assuming 'data' exists from part 2

# Create the performance_category variable based on accuracy
clean_data <- clean_data %>%
  mutate(performance_category = case_when(
    accuracy >= 90 ~ "High",           # High if accuracy is >= 90
    accuracy >= 70 & accuracy < 90 ~ "Medium",  # Medium if accuracy is between 70 and 90
    accuracy < 70 ~ "Low"              # Low if accuracy is < 70
  ))

head(clean_data)

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

library(dplyr)

# Calculate the overall mean reaction time, excluding NA values
mean_reaction_time <- mean(clean_data$reaction_time, na.rm = TRUE)

## Error: object 'clean_data' not found

# Filter the dataset: Experimental condition and reaction time faster than the mean
filtered_data <- clean_data %>%
  filter(condition == "Experimental" & reaction_time < mean_reaction_time)

## Error: object 'clean_data' not found

# View the filtered data
head (filtered_data)

## Error: object 'filtered_data' not found

I removed missing values from the dataset using na.omit() , created a new variable (performance_category) , and filtered data based on conditions and reactions.

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

Load the psych package

install.packages(“psych”) library(psych) data$anxiety_change <- data$anxiety_pre - data$anxiety_post cor_data <- data[, c(“reaction_time”, “accuracy”, “anxiety_pre”, “anxiety_post”, “anxiety_change”)] corPlot(cor_data, upper = TRUE, main = “Correlation Plot for Selected Variables”) ```

anxiety_pre and anxiety_post seem to be positively correlated. anxiety_change is negatively correlated with anxiety_pos. The research suggests that higher anxiety leads to slower reaction times, and could inform future experiments exploring how stress impacts cognitive performance.

Part 5: Reflection and Application

Question 7: Reflection

Reflect on how the statistical concepts and R techniques covered in this course apply to psychological research:

Describe a specific research question in psychology that interests you. What type of data would you collect, what statistical analyses would be appropriate, and what potential measurement errors might you need to address? A specific reserach questionin psychology that interests me is: What factors make people more likely to lie in everyday social situations? To do this, I would conduct a study that collects quantitative data in self-reportest questionnaires, scales that measure personality traits, a an experiment in a lab that puts participants in a situation where lying benefits them to observe behavior. Potential measurement errots could be bias from participants where they lie to appear better socially or where they behave differently becuase they know they are being studied.
How has learning R for data analysis changed your understanding of psychological statistics? What do you see as the biggest advantages and challenges of using R compared to other statistical software?

Learning R for datat analysis has chnaged by understandingof psychological studies becuase it allows me to apply statistical idea in a real world way. Hwen runnning chunks, I can see results right away, giving me a better and more thorough understaning of data. The biggest advantage is the customization of the space, and the biggest disadvanatge is running into errors in the code and self-correcting.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Publish your assignment to RPubs and submit the URL to canvas.