Take-Home Midterm Exam: Introductory Psychological Statistics

Replace “Your Name” with your actual name.

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Write your answer(s) here Nominal data is categorical, it includes things like labels and names. An example of this would be a survey that asks participants what their preferred type of therapy is. The possible reponses would be “CBT” or “DBT”. Ordinal data has ordered categories. An example would be a questionaire that assesses the level of agreement with a statement. The possible answers would be a scale of strongly disagree, disagree, neutral, agree, and strongly agree. Interval data is numerical and continous. An example of this would be a measurment of reaction time. Reaction time is a continous variable so the measure will be all positive real numbers. Ratio data is numerical, but it has a true zero. An example of this would be the number of errors an individual makes on a standard times table test.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Write your answer(s) here Likert scale rating of agreement would be an example of ordinal measurment because it is a scale of ordered categories. Diagnostic categories would be an example of nominal measument because it is names of disorders. Response time in milliseconds would be an example of ratio measurement because there is a true zero. Scores on a depression inventory (0-63) is an example of interval measurment because zero does not mean zero depression. Age in years would be ratio measurment because zero means the newborn has not aged yet.

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Write your answer(s) here Random errors are unpredictable and inconsistent. An example of a random error would be an individual told to recall words, but their memory is worse because there was a distraction in the trial room. A systematic error has predictable bias and is consistent. An example of this would be a memorization period that is too short, so participants continously are not able to recall words.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

Write your answer(s) here Measurement error affects the validity of a study by adding inaccuracies. This can lead to an incorrect conclusions and reduced credibility. For example, if a study examining the relationship between stress and academic performance has consistently poor worded questions, then the data may show an inaccurate relationship between the two.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

# Create a simulated dataset
set.seed(123)  # For reproducibility

# Number of participants
n <- 50

# Create the data frame
data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # We'll fill this in based on condition
)

# Make the experimental condition reduce anxiety more than control
data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

# Ensure anxiety doesn't go below 0
data$anxiety_post <- pmax(data$anxiety_post, 0)

# Add some missing values for realism
data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# View the first few rows of the dataset
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

describeBy(data[, c("reaction_time", "accuracy")], data$condition)

## 
##  Descriptive statistics by group 
## group: Control
##               vars  n   mean    sd median trimmed   mad    min    max  range
## reaction_time    1 30 301.40 48.54 299.68  300.42 55.38 201.67 408.45 206.78
## accuracy         2 29  85.49  9.86  85.53   85.68  8.77  61.91 105.50  43.59
##                skew kurtosis   se
## reaction_time  0.14    -0.66 8.86
## accuracy      -0.15    -0.35 1.83
## ------------------------------------------------------------ 
## group: Experimental
##               vars  n   mean    sd median trimmed   mad    min    max  range
## reaction_time    1 17 295.75 38.37 288.49  295.61 43.74 215.67 377.94 162.27
## accuracy         2 19  88.06  8.20  88.32   87.76  9.86  74.28 106.87  32.59
##               skew kurtosis   se
## reaction_time 0.00    -0.27 9.31
## accuracy      0.45    -0.45 1.88

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

data <- data %>%
  mutate(anxiety_change = (anxiety_pre) - (anxiety_post))

data %>% 
  group_by(condition) %>%
  summarize(mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

mean <- 350
sd <- 75
prop_more_than_450 <- pnorm(450, mean, sd)
prop_more_than_450

## [1] 0.9087888

prob_between_300_and_400 <-pnorm(400, mean, sd) - pnorm(300, mean, sd) 
prob_between_300_and_400

## [1] 0.4950149

Write your answer(s) here The probability that a randomly selected participant will have a reaction time greater than 450ms is 0.9087888. The probability that a randomly selected participant will have a reaction time between 300 and 400ms is 0.4950149

Part 3: Data Cleaning and Manipulation

Question 5: Data Cleaning with dplyr

Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

Remove all rows with missing values and create a new dataset called clean_data.

clean_data <- data %>%
  na.omit(missing)

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

clean_data <- clean_data %>%
  mutate(performance_category = case_when(accuracy >= 90 ~ "High", accuracy >= 70 & accuracy < 90 ~ "Medium", accuracy < 70 ~ "Low"))

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

mean_reaction_time <- mean(clean_data$reaction_time, na.rm = TRUE)

filtered_data <- clean_data %>%
  filter(condition == "Experimental" & reaction_time < mean_reaction_time)
head(filtered_data)

##   participant_id reaction_time accuracy gender    condition anxiety_pre
## 1              2      288.4911 84.71453 Female Experimental    31.15234
## 2             15      272.2079 74.28209 Female Experimental    21.66514
## 3             24      263.5554 77.90799   Male Experimental    42.02762
## 4             26      215.6653 95.25571 Female Experimental    16.23203
## 5             32      285.2464 88.85280 Female Experimental    35.10548
## 6             38      296.9044 89.35181 Female Experimental    25.67790
##   anxiety_post anxiety_change performance_category
## 1     19.21510      11.937239               Medium
## 2     16.64266       5.022479               Medium
## 3     31.90485      10.122765               Medium
## 4      8.05278       8.179250                 High
## 5     27.37644       7.729041               Medium
## 6     16.42095       9.256947               Medium

Write your answer(s) here describing your data cleaning process.

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

# Your code here. Hint: first, with dplyr create a new dataset that selects only the numeric variable (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
selected_data <- data[, c("reaction_time", "accuracy", "anxiety_pre", "anxiety_post", "anxiety_change")]

corPlot(selected_data, numbers=TRUE, main="Correlation Plot")

The variables that appear to be strongly related are anxiety_pre and anxiety_post (0.90).The other strongly related variables are Anxiety_change and anxiety_pre (0.25). The relationship between reaction_time and accuracy (-0.04) is surprising because I would’ve assumed higher accuracy meant one could expect faster reaction times. It is also surprising to see the relationship between anxiety_change and anxiety_post is -0.20. This can inform further research that activities which reduce anxiety before tasks is very helpful in reducing anxiety after tasks. It also suggests that reaction time is not a good way to measure intelligence/accuracy in a topic.

Part 5: Reflection and Application

Question 7: Reflection

Reflect on how the statistical concepts and R techniques covered in this course apply to psychological research:

Describe a specific research question in psychology that interests you. What type of data would you collect, what statistical analyses would be appropriate, and what potential measurement errors might you need to address?

How does the amount of screen time a child receives affect hours slept per night. I would collect the total screen time per day (in hours) as the independent variable. The dependent variable would be the total hours of sleep per night. The control variables would be age, types of screen use, and bedtime environment. Descriptive statistics would be used to summarize the average screen time and sleep duration. Analysis would show if screen time predicts sleep duration. Some potential measurement errors may occur, such as self-report bias if a survey where to be put out.

How has learning R for data analysis changed your understanding of psychological statistics? What do you see as the biggest advantages and challenges of using R compared to other statistical software?

Learning R for data analysis has shown me that there is a lot more coding required in psychology than I thought. What seems to be the biggest advantage of using R is the ability for research to be shared and recreated. The challenge of using R is that the software is very sensitive and coding script errors can be hard to find.

Write your answer(s) here

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Publish your assignment to RPubs and submit the URL to canvas.