First, I used pandas to access the data from the csv file. I did data.head() because that’s a habit I got from one of my “R” courses that I took, but python version. The head() function allows users to peek at the first few contents of the data so that we know what we’re working with. I know that data is usually shared on github, so it’s important that users have generalized (but unique–I suppose encrypted) identification so that their privacy is protected. I then labeled the columns for better data organization.
import pandas as pd
# Load data from the CSV file
file_path = '/Users/vivianhhuynh/Downloads/pset5/learningOutcomes.csv'
data = pd.read_csv(file_path)
# Display the first few rows of the data to peek at the csv
data.head()
## 5410951 activity2 53
## 0 5410952 activity2 79
## 1 5410953 activity1 207
## 2 5410954 activity1 179
## 3 5410955 activity2 68
## 4 5410956 activity1 135
# Rename columns for readability
data.columns = ['ID', 'Activity', 'Outcome']
Next, we wanted to find the observed difference in sample means of the learning outcomes between students given activity 1 (group A) and students given activity 2 (group B). Since we labeled the columns, we can now “access” the strings that indicate whether a student was given activity1 or activity2. To calculate the observed difference, I used the mean() function to find the mean of group A and group B, then subtracted the mean from group B from the mean of group A. We see below that the difference is -8.398, which means that on average, students who were in group B (given activity 2) scored about 8.398 points higher on their learning outcomes compared to those who were in group A (assigned activity 1). The difference is negative because activity 2 is allegedly more effective in improving student learning outcomes compared to activity 1.
# Calculate the mean outcomes for groups 1 and 2
mean_activity1 = data[data['Activity'] == 'activity1']['Outcome'].mean()
mean_activity2 = data[data['Activity'] == 'activity2']['Outcome'].mean()
# Calculate the observed difference in means
observed_difference = mean_activity1 - mean_activity2
observed_difference
## -8.398085385568976
For part b), we wanted to calculate the p-value for the observed difference of means. We wanted to know what the probability that we could have sampled two groups of students such that we have observed a difference of means as extreme, or more extreme, than the one calculated from our data. To do this, I used permutations. Even though we know that the learning outcomes for students in group A and students in group B are identically distributed, permutation tests don’t need a specific parametric form, so this method is useful and just as valid since we have direct data (the csv).
I combined the learning outcomes from both groups into a single dataset and then randomly shuffled and split this data into two new groups 10,000 times, each time calculating the difference in means between the two groups. The p-value was computed as the proportion of times the absolute difference in means from these permutations was greater than or equal to the observed difference.
This p-value indicates the likelihood of observing a difference in means as extreme as the observed one under the null hypothesis that there is no difference between the two activities. (:
import numpy as np
# Set the number of permutations
n_permutations = 10000
# Combine the data into one array
combined_outcomes = data['Outcome'].values
# Observed difference in means
observed_diff = observed_difference
# Initialize an array to store the differences from permutations
perm_diffs = np.zeros(n_permutations)
# Perform the permutation test
for i in range(n_permutations):
# Shuffle the combined data
np.random.shuffle(combined_outcomes)
# Split the data into two groups
perm_group1 = combined_outcomes[:len(data[data['Activity'] == 'activity1'])]
perm_group2 = combined_outcomes[len(data[data['Activity'] == 'activity1']):]
# Calculate diff in means for ith permutation
perm_diffs[i] = np.mean(perm_group1) - np.mean(perm_group2)
# Calculate p-value
p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
p_value
## 0.0406