First, I loaded the dataset of the peer grades from the csv file into a dataframe. My code wasn’t working at first because I used the wrong column name (assumed it was “peer” as opposed to “65”), so I included a sanity check so that I could see for myself what the column was.
import pandas as pd
# Load dataset
file_path = '/Users/vivianhhuynh/Downloads/pset5/peerGrades.csv'
peer_grades = pd.read_csv(file_path)
# Display the first few rows of the dataset to peek at the data
peer_grades.head()
## 65
## 0 68
## 1 80
## 2 100
## 3 73
## 4 82
# Sanity check to see what the column name was (Originally I thought it was 'Peer', but it's actually '65')
print(peer_grades.columns.tolist())
## ['65']
Not to be a tryhard, but I simply plotted the distribution of the peer grades, using a histrogram to see the shape of the data. I did this because it’s much easier to visually spot outliers, skewness, and such, as opposed to looking at the raw numbers. Outliers, skewness, and other patterns would affect whether or not we choose the mean or the median.
# Using matlab here to plot the data: enjoy!
import matplotlib.pyplot as plt
# Plot the distribution of peer grades
plt.figure(figsize=(10, 6))
plt.hist(peer_grades['65'], bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Peer Grades')
plt.xlabel('Grade')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
I defined a function called simulate_variance_of_mean_and_median(data, n_simulations=10000), that simulates the process of calculating the variance of both the mean and the median of 5 peer grades.
For each of the 10,000 simulations, the function randomly selected a sample of 5 grades from the dataset without replacement. For each sample, the function calculated both the mean and the median and appended it to a separate list for both the mean and median, per simulation. After the end of the 10,000 simulations, I calculated the variance of all of the means and all of the medians.
If students were given a final score based on the mean of the 5 grades given by their peers, the variance of the mean for a sample of 5 grades was calculated through simulations. In the simulation, 10,000 samples of 5 grades were randomly selected from the dataset, and the variance of the mean of these samples was computed. The calculated variance of the mean is approximately 100.7.
If students were given a final score based on the median of the 5 grades given by their peers, the variance of the median for a sample of 5 grades was also calculated through simulations. Similar to the mean, 10,000 samples of 5 grades were randomly selected, and the variance of the median of these samples was computed. The calculated variance of the median is approximately 57.4.
The lower variance of the median compared to the mean suggests that the median is a more stable and reliable measure for assigning final scores in this context.
I would use the median of the 5 peer grades to assign scores. The reason for choosing the median over the mean is that the median is less sensitive to outliers and provides a more robust statistic, especially when dealing with a small number of peer grades, reducing the risk of a single outlier-grade disproportionately affecting a student’s final score. It seems like the median is more stable than the mean because its variance was smaller.
import numpy as np
# Function to simulate the experiment
def simulate_variance_of_mean_and_median(data, n_simulations=10000):
mean_variances = []
median_variances = []
for _ in range(n_simulations):
sample = np.random.choice(data, 5, replace=False)
mean_variances.append(np.mean(sample))
median_variances.append(np.median(sample))
mean_variance = np.var(mean_variances)
median_variance = np.var(median_variances)
return mean_variance, median_variance
# Run the function that simulates experiment
mean_var, median_var = simulate_variance_of_mean_and_median(peer_grades['65'])
mean_var, median_var
## (102.1577500444, 56.96569679)