My Process

First, I used pandas to access the data from the csv file. I did data.head() because that’s a habit I got from one of my “R” courses that I took, but python version. The head() function allows users to peek at the first few contents of the data so that we know what we’re working with. I know that data is usually shared on github, so it’s important that users have generalized (but unique–I suppose encrypted) identification so that their privacy is protected. I then labeled the columns for better data organization.

import pandas as pd

# Load data from the CSV file
file_path = '/Users/vivianhhuynh/Downloads/pset5/learningOutcomes.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the data to peek at the csv
data.head()
##    5410951  activity2   53
## 0  5410952  activity2   79
## 1  5410953  activity1  207
## 2  5410954  activity1  179
## 3  5410955  activity2   68
## 4  5410956  activity1  135

# Rename columns for readability
data.columns = ['ID', 'Activity', 'Outcome']

Next, we wanted to find the observed difference in sample means of the learning outcomes between students given activity 1 (group A) and students given activity 2 (group B). Since we labeled the columns, we can now “access” the strings that indicate whether a student was given activity1 or activity2. To calculate the observed difference, I used the mean() function to find the mean of group A and group B, then subtracted the mean from group B from the mean of group A. We see below that the difference is -8.398, which means that on average, students who were in group B (given activity 2) scored about 8.398 points higher on their learning outcomes compared to those who were in group A (assigned activity 1). The difference is negative because activity 2 is allegedly more effective in improving student learning outcomes compared to activity 1.

# Calculate the mean outcomes for groups 1 and 2 
mean_activity1 = data[data['Activity'] == 'activity1']['Outcome'].mean()
mean_activity2 = data[data['Activity'] == 'activity2']['Outcome'].mean()

# Calculate the observed difference in means
observed_difference = mean_activity1 - mean_activity2

observed_difference
## -8.398085385568976

For part b), we wanted to calculate the p-value for the observed difference of means. We wanted to know what the probability that we could have sampled two groups of students such that we have observed a difference of means as extreme, or more extreme, than the one calculated from our data. To do this, I used permutations. Even though we know that the learning outcomes for students in group A and students in group B are identically distributed, permutation tests don’t need a specific parametric form, so this method is useful and just as valid since we have direct data (the csv).

I combined the learning outcomes from both groups into a single dataset and then randomly shuffled and split this data into two new groups 10,000 times, each time calculating the difference in means between the two groups. The p-value was computed as the proportion of times the absolute difference in means from these permutations was greater than or equal to the observed difference.

This p-value indicates the likelihood of observing a difference in means as extreme as the observed one under the null hypothesis that there is no difference between the two activities. (:

import numpy as np

# Set the number of permutations
n_permutations = 10000

# Combine the data into one array
combined_outcomes = data['Outcome'].values

# Observed difference in means
observed_diff = observed_difference

# Initialize an array to store the differences from permutations
perm_diffs = np.zeros(n_permutations)

# Perform the permutation test
for i in range(n_permutations):
    # Shuffle the combined data
    np.random.shuffle(combined_outcomes)
    
    # Split the data into two groups
    perm_group1 = combined_outcomes[:len(data[data['Activity'] == 'activity1'])]
    perm_group2 = combined_outcomes[len(data[data['Activity'] == 'activity1']):]
    
    # Calculate diff in means for ith permutation
    perm_diffs[i] = np.mean(perm_group1) - np.mean(perm_group2)

# Calculate p-value
p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))

p_value
## 0.0406