Take-Home Midterm Exam: Introductory Psychological Statistics

Replace “Your Name” with your actual name.

Instructions

Please complete this exam on your own. Include your R code, interpretations, and answers within this document.

Part 1: Types of Data and Measurement Errors

Question 1: Data Types in Psychological Research

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Nominal data refers to categorical data that is in no inherent order and only allows counting and frequency analysis. For example, gender, ethnicity, and diagnostic categories are all nominal data. On the other hand, ordinal data is another categorical data that DOES have a meaningful order, but with unequal intervals like education levels and Likert scales, for instance. Ordinal data also allows for comparisons, but not arithmetic operations. Interval data is numerical data that has equal intervals, but with NO true zero point such as IQ scores and temperatures in celsius, for instance. In contrast, ratio data is another type of numerical data that DOES HAVE a true zero point and equal intervals. For instance, reaction time and weight are considered ratio data. Furthermore, only ratio data allows for meaningful ratios and both ratio data and interval data allow arithmetic operations.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Scores on a depression inventory (0-63): Ratio data because the scores are numerical data that have a true zero point with equal intervals. Because the data has a true zero point with equal intervals, we can make more meaningful ratios like how a higher score indicates more severe depression than a lower score.
Response time in milliseconds: Ratio data because the response time in milliseconds is numerical data that has a true zero point with equal intervals. By looking at this ratio data, we can make meaningful ratios/connections like identifying a response time that was twice as fast as another response time.
Likert scale ratings of agreement (1-7): Ordinal data because the Likert scale ratings of agreement are categorical data that have meaningful order without equal intervals. For example, a score of 7 could indicate that one strongly agrees with a statement while a score of 2 could indicate that one disagrees with the same statement, thus making them unequal intervals.
Diagnostic categories: Nominal data because diagnostic categories are data that are in no inherent order and permits only frequency analysis and counting. For instance, diagnostic categories like “ADHD,” “anxiety,” and “multiple sclerosis” are in no inherent order.
Age in years: Ratio data because age in years are numerical data with a true zero point and equal intervals. For instance, one person can be 16 years old and another person can be 32 years old, which is twice as older as the 16 year old. **

Question 2: Measurement Error

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Random error is when there are unpredictable fluctuations in experiments that reduces reliability, but doesn’t systematically bias the results. For example, in a memory experiment, random error may occur if the participants are distracted by other stimuli, thus affecting the results. On the other hand, systematic error is when there are consistent and predictable deviations in measurements. Systematic errors may bias results in a specific direction, whereas random error does not systematically bias the results, as the fluctuations in experiments are unpredictable. For example, in a memory experiment, experimenter bias may occur if the experimenter guides the participants answers or memory, which directly affects the results of the experiment. Furthermore, another key difference between random error and systematic error is that random error can be reduced by increasing the sample size, whereas systematic errors cannot be reduced by increasing the sample size.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

Measurement errors may affect the validity of a study examining the relationship between stress and academic performance because of reliability issues or random and systematic errors. Without reliability of results in an experiment, the results may end up invalid because it is not accurately measuring what it is intended to measure, consistently. On the other hand, if the study has random or systematic errors, results may end up biased. Researchers can minimize these errors by re-calibrating instruments, increasing sample size, or by completing thorough and careful analysis of their data.

Part 2: Descriptive Statistics and Basic Probability

Question 3: Descriptive Analysis

The code below creates a simulated dataset for a psychological experiment. Run the below code chunk without making any changes:

# Create a simulated dataset
set.seed(123)  # For reproducibility

# Number of participants
n <- 50

# Create the data frame
data <- data.frame(
  participant_id = 1:n,
  reaction_time = rnorm(n, mean = 300, sd = 50),
  accuracy = rnorm(n, mean = 85, sd = 10),
  gender = sample(c("Male", "Female"), n, replace = TRUE),
  condition = sample(c("Control", "Experimental"), n, replace = TRUE),
  anxiety_pre = rnorm(n, mean = 25, sd = 8),
  anxiety_post = NA  # We'll fill this in based on condition
)

# Make the experimental condition reduce anxiety more than control
data$anxiety_post <- ifelse(
  data$condition == "Experimental",
  data$anxiety_pre - rnorm(n, mean = 8, sd = 3),  # Larger reduction
  data$anxiety_pre - rnorm(n, mean = 3, sd = 2)   # Smaller reduction
)

# Ensure anxiety doesn't go below 0
data$anxiety_post <- pmax(data$anxiety_post, 0)

# Add some missing values for realism
data$reaction_time[sample(1:n, 3)] <- NA
data$accuracy[sample(1:n, 2)] <- NA

# View the first few rows of the dataset
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

Now, perform the following computations*:

Calculate the mean, median, standard deviation, minimum, and maximum for reaction time and accuracy, grouped by condition (hint: use the psych package).

library(psych)

# Calculate descriptive statistics of reaction_time grouped by condition
describeBy(data$reaction_time, data$condition)

## 
##  Descriptive statistics by group 
## group: Control
##    vars  n  mean    sd median trimmed   mad    min    max  range skew kurtosis
## X1    1 30 301.4 48.54 299.68  300.42 55.38 201.67 408.45 206.78 0.14    -0.66
##      se
## X1 8.86
## ------------------------------------------------------------ 
## group: Experimental
##    vars  n   mean    sd median trimmed   mad    min    max  range skew kurtosis
## X1    1 17 295.75 38.37 288.49  295.61 43.74 215.67 377.94 162.27    0    -0.27
##      se
## X1 9.31

describeBy(data$accuracy, data$condition)

## 
##  Descriptive statistics by group 
## group: Control
##    vars  n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
## X1    1 29 85.49 9.86  85.53   85.68 8.77 61.91 105.5 43.59 -0.15    -0.35 1.83
## ------------------------------------------------------------ 
## group: Experimental
##    vars  n  mean  sd median trimmed  mad   min    max range skew kurtosis   se
## X1    1 19 88.06 8.2  88.32   87.76 9.86 74.28 106.87 32.59 0.45    -0.45 1.88

Using dplyr and piping, create a new variable anxiety_change that represents the difference between pre and post anxiety scores (pre minus post). Then calculate the mean anxiety change for each condition.

library(dplyr)
head(data)

##   participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1              1      271.9762  87.53319 Female      Control    31.30191
## 2              2      288.4911  84.71453 Female Experimental    31.15234
## 3              3      377.9354  84.57130 Female Experimental    27.65762
## 4              4      303.5254  98.68602   Male      Control    16.93299
## 5              5      306.4644  82.74229 Female      Control    24.04438
## 6              6      385.7532 100.16471 Female      Control    22.75684
##   anxiety_post
## 1     29.05312
## 2     19.21510
## 3     20.45306
## 4     13.75199
## 5     17.84736
## 6     19.93397

# Create new variable named anxiety_change
data <- data %>%
  mutate(anxiety_change = anxiety_pre - anxiety_post)

# Calculate mean of anxiety change for each condition
data %>%
  group_by(condition) %>%
  summarize(mean_anxiety_change = mean(anxiety_change, na.rm = TRUE))

## # A tibble: 2 × 2
##   condition    mean_anxiety_change
##   <chr>                      <dbl>
## 1 Control                     3.79
## 2 Experimental                8.64

The mean anxiety change for the control group ended up being 3.79 and the mean anxiety change for the experimental group was 8.64.

Question 4: Probability Calculations

Using the concepts from Chapter 4 (Descriptive Statistics and Basic Probability in Psychological Research):

If reaction times in a cognitive task are normally distributed with a mean of 350ms and a standard deviation of 75ms:
1. What is the probability that a randomly selected participant will have a reaction time greater than 450ms?
2. What is the probability that a participant will have a reaction time between 300ms and 400ms?

# (a) Probability of reaction time greater than 450ms
1 - pnorm(450, mean = 350, sd = 75)

## [1] 0.09121122

#(b) Probability of reaction time between 300ms and 400ms
pnorm(400, mean = 350, sd = 75) - pnorm(300, mean = 350, sd = 75)

## [1] 0.4950149

** a. The probability that a randomly selected participant will have a reaction time greater than 450ms is 0.09% or 9%.

The probability that a randomly selected participant will have a reaction time between 300ms and 400ms is 0.49% or rounded up, about a 50% chance. **

Part 3: Data Cleaning and Manipulation

Question 5: Data Cleaning with dplyr

Using the dataset created in Part 2, perform the following data cleaning and manipulation tasks:

Remove all rows with missing values and create a new dataset called clean_data.

# Remove rows with any NA values and create new dataset clean_data
clean_data <- data %>%
  na.omit() %>%
  print(clean_data)

## Error: object 'clean_data' not found

Create a new variable performance_category that categorizes participants based on their accuracy:
- “High” if accuracy is greater than or equal to 90
- “Medium” if accuracy is between 70 and 90
- “Low” if accuracy is less than 70

# Create new variable performance_category based on their accuracy
clean_data <- data %>%
  mutate(
    performance_category = case_when(
      accuracy >= 90 ~ "High", accuracy >= 70 & accuracy < 90 ~ "Medium", accuracy < 70 ~ "Low"
    ))
print(clean_data)

##    participant_id reaction_time  accuracy gender    condition anxiety_pre
## 1               1      271.9762  87.53319 Female      Control    31.30191
## 2               2      288.4911  84.71453 Female Experimental    31.15234
## 3               3      377.9354  84.57130 Female Experimental    27.65762
## 4               4      303.5254  98.68602   Male      Control    16.93299
## 5               5      306.4644  82.74229 Female      Control    24.04438
## 6               6      385.7532 100.16471 Female      Control    22.75684
## 7               7      323.0458  69.51247 Female      Control    29.50392
## 8               8      236.7469  90.84614   Male      Control    22.02049
## 9               9            NA  86.23854 Female Experimental    32.81579
## 10             10      277.7169  87.15942 Female      Control    22.00335
## 11             11            NA  88.79639 Female Experimental    33.42169
## 12             12      317.9907  79.97677   Male Experimental    16.60658
## 13             13      320.0386  81.66793   Male Experimental    14.91876
## 14             14      305.5341  74.81425 Female      Control    50.92832
## 15             15      272.2079  74.28209 Female Experimental    21.66514
## 16             16            NA  88.03529 Female      Control    27.38582
## 17             17      324.8925  89.48210 Female Experimental    30.09256
## 18             18      201.6691  85.53004   Male      Control    21.12975
## 19             19      335.0678  94.22267 Female      Control    29.13490
## 20             20      276.3604 105.50085   Male      Control    27.95172
## 21             21      246.6088  80.08969 Female      Control    23.27696
## 22             22      289.1013  61.90831   Male      Control    25.52234
## 23             23      248.6998  95.05739   Male      Control    24.72746
## 24             24      263.5554  77.90799   Male Experimental    42.02762
## 25             25      268.7480  78.11991 Female      Control    19.06931
## 26             26      215.6653  95.25571 Female Experimental    16.23203
## 27             27      341.8894  82.15227   Male      Control    25.30231
## 28             28      307.6687  72.79282   Male      Control    27.48385
## 29             29      243.0932  86.81303 Female      Control    28.49219
## 30             30      362.6907        NA   Male      Control    21.33308
## 31             31      321.3232  85.05764   Male Experimental    16.49339
## 32             32      285.2464  88.85280 Female Experimental    35.10548
## 33             33      344.7563  81.29340 Female      Control    22.20280
## 34             34      343.9067  91.44377   Male      Control    18.07590
## 35             35      341.0791  82.79513 Female      Control    23.10976
## 36             36      334.4320  88.31782 Female Experimental    23.42259
## 37             37      327.6959  95.96839 Female Experimental    33.87936
## 38             38      296.9044  89.35181 Female Experimental    25.67790
## 39             39      284.7019  81.74068 Female      Control    31.03243
## 40             40      280.9764  96.48808   Male Experimental    21.00566
## 41             41      265.2647  94.93504   Male      Control    26.71556
## 42             42      289.6041  90.48397 Female      Control    22.40251
## 43             43      236.7302        NA   Male      Control    25.75667
## 44             44      408.4478  78.72094 Female      Control    17.83709
## 45             45      360.3981  98.60652   Male      Control    14.51359
## 46             46      243.8446  78.99740   Male Experimental    40.97771
## 47             47      279.8558 106.87333   Male Experimental    29.80567
## 48             48      276.6672 100.32611 Female Experimental    14.98983
## 49             49      338.9983  82.64300 Female      Control    20.11067
## 50             50      295.8315  74.73579 Female      Control    15.51616
##    anxiety_post anxiety_change performance_category
## 1     29.053117     2.24879426               Medium
## 2     19.215099    11.93723893               Medium
## 3     20.453056     7.20456483               Medium
## 4     13.751994     3.18099329                 High
## 5     17.847362     6.19701754               Medium
## 6     19.933968     2.82286978                 High
## 7     24.342317     5.16159899                  Low
## 8     17.758982     4.26150823                 High
## 9     19.863065    12.95272240               Medium
## 10    22.069157    -0.06580401               Medium
## 11    25.063956     8.35773571               Medium
## 12     7.875522     8.73106229               Medium
## 13     3.221330    11.69742764               Medium
## 14    45.327922     5.60039736               Medium
## 15    16.642661     5.02247855               Medium
## 16    21.290659     6.09516212               Medium
## 17    23.416047     6.67651035               Medium
## 18    21.642810    -0.51305479               Medium
## 19    26.912456     2.22244027                 High
## 20    24.773302     3.17841445                 High
## 21    18.586930     4.69002601               Medium
## 22    20.597288     4.92505594                  Low
## 23    20.358843     4.36861886                 High
## 24    31.904850    10.12276506               Medium
## 25    14.370025     4.69928609               Medium
## 26     8.052780     8.17924981                 High
## 27    21.952702     3.34960540               Medium
## 28    24.334744     3.14910235               Medium
## 29    24.635854     3.85633353               Medium
## 30    18.283727     3.04934997                 <NA>
## 31     2.627509    13.86588190               Medium
## 32    27.376440     7.72904122               Medium
## 33    18.430744     3.77205314               Medium
## 34    15.607200     2.46869675                 High
## 35    19.873474     3.23628902               Medium
## 36    19.373641     4.04895160               Medium
## 37    26.428138     7.45122383                 High
## 38    16.420951     9.25694721               Medium
## 39    28.470531     2.56189924               Medium
## 40    15.350273     5.65539054                 High
## 41    21.378795     5.33676775                 High
## 42    17.294151     5.10836205                 High
## 43    20.466142     5.29052622                 <NA>
## 44    15.992029     1.84506400               Medium
## 45     7.508622     7.00496546                 High
## 46    27.270622    13.70708547               Medium
## 47    22.108595     7.69707534                 High
## 48    11.069351     3.92047789                 High
## 49    17.068705     3.04196717               Medium
## 50    10.016330     5.49982914               Medium

Filter the dataset to include only participants in the Experimental condition with reaction times faster than the overall mean reaction time.

# Filter the dataset to keep only reaction times > mean reaction time
data %>%
  filter(condition == "Experimental" & reaction_time > mean(reaction_time, na.rm = TRUE))

##   participant_id reaction_time accuracy gender    condition anxiety_pre
## 1              3      377.9354 84.57130 Female Experimental    27.65762
## 2             12      317.9907 79.97677   Male Experimental    16.60658
## 3             13      320.0386 81.66793   Male Experimental    14.91876
## 4             17      324.8925 89.48210 Female Experimental    30.09256
## 5             31      321.3232 85.05764   Male Experimental    16.49339
## 6             36      334.4320 88.31782 Female Experimental    23.42259
## 7             37      327.6959 95.96839 Female Experimental    33.87936
##   anxiety_post anxiety_change
## 1    20.453056       7.204565
## 2     7.875522       8.731062
## 3     3.221330      11.697428
## 4    23.416047       6.676510
## 5     2.627509      13.865882
## 6    19.373641       4.048952
## 7    26.428138       7.451224

First, I removed all missing or NA data from the dataset by using dplyr and the pipe function, then created a new variable called clean_data. I then printed a new dataset with the cleaned data. Then, I created a new variable called performance_category that categorized the participants based on their accuracy, whether the accuracy was greater than or equal to 90 (higher), between 70 and 90 (medium), and less than 70 (low). I did this. by using the mutate function to create the new variable, and using the case_when function to categorize the accuracy speed. After this I printed the new data that included the performance_category variables. Lastly, I filtered the new dataset so that we can keep only reaction times that were faster than the overall mean reaction time by using the filter function to simplify the data.

Part 4: Visualization and Correlation Analysis

Question 6: Correlation Analysis with the psych Package

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

# Select numeric variables from the dataset and create corPlot. Hint: first, with dplyr create a new dataset that selects only the numeric variable (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
numeric_data <- clean_data %>%
  select(reaction_time, accuracy, anxiety_pre, anxiety_post, anxiety_change) %>%
  corPlot(upper = FALSE)

## Error in plot.new(): figure margins too large

The variables that appear to be strongly correlated are anxiety_pre and anxiety_post, such that as anxiety_pre increases, so does anxiety_post. A surprising relationship was anxiety_pre and accuracy because the relationship was a major difference. These correlations may inform further research in psychology by highlighting the relationship between accuracy and anxiety change, as according to the plot, there is a strong correlation and in doing so, will help researchers understand how anxiety may affect overall performance accuracy.

Part 5: Reflection and Application

Question 7: Reflection

Reflect on how the statistical concepts and R techniques covered in this course apply to psychological research:

Describe a specific research question in psychology that interests you. What type of data would you collect, what statistical analyses would be appropriate, and what potential measurement errors might you need to address?
How has learning R for data analysis changed your understanding of psychological statistics? What do you see as the biggest advantages and challenges of using R compared to other statistical software?

A specific research question in psychology that interests me is: did the Covid-19 pandemic increase social anxiety in individuals? The type of data I would collect would be lifestyle habits, anxiety change, and anxiety symptoms. The statistical analyses that would be appropriate to use include correlational analysis, causal analysis, or the GAD-7 test. However, some potential measurement errors I may need to address would be observer bias and random error. Observer bias may occur if I, for example, had preconceived notions about social anxiety and the pandemic; this can lead to skewed results and data. Moreover, random error like the fluctuating emotional state of the participants in the study may negatively impact results because it’ll affect both validity and reliability of the study and data.
Learning R for data analysis has changed my understanding of psychological statistics drastically because I like how R helps you input data much quicker than having to calculate anything on your own, manually. The biggest advantage of R, in my opinion, is being able to easily create graphs and tables in a matter of seconds because it allows us to easily visualize our data analyses. On the other hand, for me, the biggest challenge of using R is understanding the different programming concepts. I get information overload very easily and R sometimes confuses me because of the many different codes it has to offer, as many codes are similar, but do different things.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Publish your assignment to RPubs and submit the URL to canvas.