Take-Home Midterm Exam: Introductory Psychological Statistics

Replace “Your Name” with your actual name.

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

**The key differences between nominal, ordinal, interval, and ratio data is based on how they order data, and if there are equal intervals and a true zero. To be specific, Nominal data has no order or numerical value. For example, a nominal value would be hair color (brown, blonde, red, black, etc). Next, Ordinal is ordered and can have numerical value but in some cases not. For instance, an example of a Ordinal value is rating a establishment on how satisfied you are (very dissatisfied, dissatisfied, neutral, satisfied). Next, Interval Data is ordered but with equal intervals between the points of data. For example, on a thermometer, the different between 40 degrees to 50 degrees is the same as 20 degrees to 30 degrees. It keeps a consistent number jump. And lastly, Ratio data combines all the characteristics of ordinal, nominal, and interval and has a true zero, ordered, and has equal intervals. For example, if someone remembers nothing on the final, they do not remember anything which is a true zero. *

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

The first one is Interval since it is ordered but does not have a true zero. Second one is Ratio since it is ordered, has equal intervals, and can have a true zero (0 secs to respond has no response time). The third one is Ordinal since it is ordered but the numbers between 1-7 can have an unequal interval and will not be the same interval consistently. The fourth one is Nominal since diagnostic categories are not ordered and do not have a numerical value. Lastly, the age in years is Ratio since the interval between 30 and 40 is the same as 10 and 20 years old. It also has a true zero (birth)

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Systematic and random error both causes issues with psychological studies since Systematic errors reduce validity and Random error reduces the reliability of the study. For example, in a memory study, if the participants are distracted by noises or a lot of visuals, that can impact the participants memory resulting in reliability issues. Systematic errors can look like giving the participants 60 seconds to memorize a word but accidentally running it for 67 seconds. This causes issues in the validity and how the results with come out

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

Measurement error affects the validity of a study for the relationship between stress and academic performance depdending on how the study is conducted. For example, if participants fill out a survey for stress and academic performance, it is important to be realistic and choose wording that could resonate with participants instead of extreme wording or exaggerations. In order to prevent any measurment error, it is important to have a reliable tool of measurement, have the participants experience the same conditions when partipating in the study, and making sure all researchers are on the same page.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

# Create a simulated dataset
set.seed(123)  # For reproducibility

# Number of participants
n <- 50

# Create the data frame
data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # We'll fill this in based on condition
)

# Make the experimental condition reduce anxiety more than control
data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

# Ensure anxiety doesn't go below 0
data$anxiety_post <- pmax(data$anxiety_post, 0)

# Add some missing values for realism
data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# View the first few rows of the dataset
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

data$reaction_time

##  [1] 271.9762 288.4911 377.9354 303.5254 306.4644 385.7532 323.0458 236.7469
##  [9]       NA 277.7169       NA 317.9907 320.0386 305.5341 272.2079       NA
## [17] 324.8925 201.6691 335.0678 276.3604 246.6088 289.1013 248.6998 263.5554
## [25] 268.7480 215.6653 341.8894 307.6687 243.0932 362.6907 321.3232 285.2464
## [33] 344.7563 343.9067 341.0791 334.4320 327.6959 296.9044 284.7019 280.9764
## [41] 265.2647 289.6041 236.7302 408.4478 360.3981 243.8446 279.8558 276.6672
## [49] 338.9983 295.8315

# Calculate the mean
mean(data$reaction_time, na.rm = TRUE)

## [1] 299.3575

# Calculate the median
median(data$reaction_time, na.rm = TRUE)

## [1] 295.8315

# Calculate the mode
get_mode <- function(x) {
  uniqv <- unique (x)
  uniqv[which.max(tabulate(match(x, uniqv)))]
}

get_mode(na.omit(data$reaction_time))

## [1] 271.9762

#Calculate variance
var(data$reaction_time, na.rm = TRUE)

## [1] 2005.04

#Calculate standard deviation
sd(data$reaction_time, na.rm = TRUE)

## [1] 44.77768

# Calculate minimum
min(data$reaction_time, na.rm = TRUE)

## [1] 201.6691

# Calculate Maximum
max(data$reaction_time, na.rm = TRUE)

## [1] 408.4478

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

# Creating a new variable using mutate()
data <- data %>% 
  mutate(anxiety_change = anxiety_pre - anxiety_post)
 
# Calculate the mean anxiety change for each conditions
data %>% group_by(condition) %>% 
summarize(mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

I created a new variable called ‘anxiety_change’ which went through which demonstrates the differences between pre anxiety and post anxiety. The mean anxiety change for the control group is 3.79 which concludes that pre and post anxiety levels reduced about 3.8 units. On the other hand, the experimental groups mean is 8.64 which shows there is a higher reduction in anxiety.

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

# Define parameter
mean <- 350
sd <- 75

# Calculate the probability of a reaction time greater than 450ms
prob_more_than <- 1 - pnorm(450, mean, sd)
print(paste("Probability of a reaction time greater than 450ms:", prob_more_than))

## [1] "Probability of a reaction time greater than 450ms: 0.0912112197258679"

# Calculate the probability of a reaction time between 300ms and 400ms
prob_between_300_and_400 <-pnorm(450, mean, sd) - pnorm(90, mean, sd)
print(paste("Probability of a score between 300ms and 400ms", prob_between_300_and_400))

## [1] "Probability of a score between 300ms and 400ms 0.908525302808537"

With a reaction time mean of 350ms and standard deviation of 75ms, it shows a probability of 0.09 that the reaction time is greater than 450ms and 0.901 probability that the reaction time is between 0.908. This shows that the probability of the reaction time being between 300ms and 400ms is much greater and more likely.

Part 3: Data Cleaning and Manipulation

Question 5: Data Cleaning with dplyr

Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

Remove all rows with missing values and create a new dataset called clean_data.

# # Remove rows with NA values and create 'clean_data'
clean_data <- data %>% na.omit()

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

data <- data %>% 
  mutate(performance_category = case_when(
    accuracy >= 90 ~ "High",
    accuracy >= 70 & accuracy < 90 ~ "Medium",
    accuracy < 70 ~ "Low",
    TRUE ~ NA_character_ ))

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

# Calculate the overall mean reaction time
mean_reaction_time <- mean(data$reaction_time, na.rm = TRUE)

#Filter the dataset for the Experimental condition and reaction times
filtered_data <- data %>% 
  filter(condition == "Experimental", reaction_time < mean_reaction_time)

I started off by removing the missing data with the clean_data and na.omit. I also added a new variable called ‘performance_category’ in order to categorize participants based on their accuracy scores. Next, I filtered the data for participants in the Experimental condition who have reaction times faster than the average. This helps narrow down the data for further analysis. This process makes sure that the daa is organized, complete, and focused on relavant and vital data

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

numeric_data <- data %>% 
  select(reaction_time, accuracy, anxiety_pre, anxiety_post, anxiety_change)

corPlot(cor(numeric_data, use= "complete.obs"))

## Error in plot.new(): figure margins too large

head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post anxiety_change performance_category
## 1     29.05312       2.248794               Medium
## 2     19.21510      11.937239               Medium
## 3     20.45306       7.204565               Medium
## 4     13.75199       3.180993                 High
## 5     17.84736       6.197018               Medium
## 6     19.93397       2.822870                 High

There is a strong positive correlation with anxiety_pre and anxiety_post showing that if pre_anxiety increases, so does post anxiety.Also noticed a correlation between reaction_time and accuracy, showing that the quicker the reaction time is, the lower the accuracy is. It was interesting to see how the reaction time impacted whether someone was anxious before or after. This helps future research since it identifies anxiety in different moments in time and how reaction time can correlate with that.

Part 5: Reflection and Application

Question 7: Reflection

Reflect on how the statistical concepts and R techniques covered in this course apply to psychological research:

Describe a specific research question in psychology that interests you. What type of data would you collect, what statistical analyses would be appropriate, and what potential measurement errors might you need to address?
How has learning R for data analysis changed your understanding of psychological statistics? What do you see as the biggest advantages and challenges of using R compared to other statistical software?

I am interested in the root of anxiety and whether it is primarily based on nature versus nurture. I experience anxiety myself and wonder whether it is based on my genes or my enviorment and experiences growing up. I would collect data such as the individuals back round, if anxiety runs in the family what their experiences were like growing up, what kind of symptoms they have and when it started. Also, collecting data on current stressors such as work, money, family, sickness, etc can help understand anxiety levels. I think for this particular study, psychological assessments by a professional would be the most conclusive option. Measurement errors would include self reported data and any recall bias due to participants not remembering. .Learning R has been challenging but it is also very satisfying when you get the code right. It is a lot of trial and error but I am sure if I was in a position where I had an experiment and had to code it, it would be very interesting. I think the advantages is that you can do most data interpretation on it and refer to it as you move on to other areas of the data. You can personalize it the way you want and have some creative freedom. I think it is a great tool for analyzing and interpreting data since it does the magic for you if you input everything correctly. Even though it is tedious, it saves time.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Publish your assignment to RPubs and submit the URL to canvas.