Problem 1

Recall Simpson’s Paradox occurs when one group shows a higher result than another group, when all the data is aggregated, but it shows the opposite when the data is subdivided into different segments.

Here’s another example

Suppose we have two students, A and B, who took a test with 100 questions on it. Let’s say that we want to determine which of the two students performed better on the test.

The following data is given:

test_A=read.csv("test_A.csv",header=TRUE)
test_B=read.csv("test_B.csv", header = TRUE)

Each data frame, consists of two binary-values columns. Each row is a test question.

Correct: 1 means the student got the question correct. 0 means they were incorrect.
Difficult: 1 means the question was difficult. 0 means the question was easy.

a) (5 points)

Using R, calculate the overall percentage of correct answers for each student.

#one way
percentage_overall_correct_A <- (test_A %>% summarize(percentage = mean(Correct) * 100))$percentage 
percentage_overall_correct_B <- (test_B %>% summarize(percentage = mean(Correct) * 100))$percentage

#another way
percentage_overall_correct_A_method2 <- (test_A%>%summarise(percentage=sum(Correct)/n())*100)$percentage
percentage_correct_overall_B_method2 <- (test_B%>%summarise(percentage=sum(Correct)/n())*100)$percentage 
#logic is set so that you can see the result print on html page.

The overall percentage of correct answers for student A is 50%, while for student B, it is 80%.

Please see the .Rmd file to see how I was able to get the answers to render like this, inline in the text, or if you are only looking at the .Rmd please see the .html for the output.

b) (5 points)

Which student appears to have done better on the test, and why?

if (percentage_overall_correct_A>percentage_overall_correct_B)
{
  result='Student A'
}else{result='Student B'} #logic is set so that you can see the result print on html page.

Student B did better on the exam overall because they scored their overall percentage of correct answers was higher.

c) (10 points)

However, it turns out that the two tests were very different! How many questions were difficult on test A and how many were difficult on test B? Show your R work below and have it calculate these values and state clearly which answer is which.

difficult_A<-(test_A%>%summarise(N=sum(Difficult==1)))$N
difficult_B<-(test_B%>%summarise(N=sum(Difficult==1)))$N

There were 95 questions on exam A and 5 questions on exam B.

d) (10 points)

Calculate the percentage of correct answers for Student A on both the difficult and easy questions and store your two results in a data frame called Student_A in which one column is Difficult_Percent_Correct and the other is Easy_Percent_Correct.

# one way
Student_A_percentages_on_each<-test_A%>%summarise(Difficult_Percent_Correct=mean(Correct[Difficult==1]) *100,Easy_Percent_Correct=mean(Correct[Difficult==0]) *100)
rownames(Student_A_percentages_on_each)<-'Student A'
#another way
Student_A_percentages_on_each_method2<-test_A%>%summarise(Difficult_Percent_Correct=sum(Correct[Difficult==1])/sum(Difficult == 1) *100,Easy_Percent_Correct=sum(Correct[Difficult==0])/sum(Difficult == 0) *100)
rownames(Student_A_percentages_on_each_method2)<-'Student A'

Here is how Student A did:

e) (5 points)

Now, as you might have expected I would ask, calculate the percentage of correct answers for Student B on both the difficult and easy questions and store your two results in a data frame called Student_B in which one column is Difficult_Percent_Correct and the other is Easy_Percent_Correct.

Calculate the percentage of correct answers for Student A on both the difficult and easy questions and store your two results in a data frame called Student_A in which one column is Difficult_Percent_Correct and the other is Easy_Percent_Correct.

# one way
Student_B_percentages_on_each<-test_B%>%summarise(Difficult_Percent_Correct=mean(Correct[Difficult==1]) *100,Easy_Percent_Correct=mean(Correct[Difficult==0]) *100)
rownames(Student_B_percentages_on_each)<-'Student B'

#another way
Student_B_percentages_on_each_method2<-test_B%>%summarise(Difficult_Percent_Correct=sum(Correct[Difficult==1])/sum(Difficult == 1) *100,Easy_Percent_Correct=sum(Correct[Difficult==0])/sum(Difficult == 0) *100)
rownames(Student_B_percentages_on_each_method2)<-'Student B'

Here is how student B did:

f) (5 points)

Which student has a higher percentage of both the difficult and easy problems correct?

compare_students <- function(df_A ,  df_B) {
   # Compare percentages and create result strings
  difficult_result <- ifelse(df_A$Difficult_Percent_Correct > df_B$Difficult_Percent_Correct,
                             "Student A performed better on difficult questions.",
                             ifelse(df_A$Difficult_Percent_Correct < df_B$Difficult_Percent_Correct,
                                    "Student B performed better on difficult questions.",
                                    "Both students performed equally well on difficult questions."))
  
  easy_result <- ifelse(df_A$Easy_Percent_Correct > df_B$Easy_Percent_Correct,
                        "Student A performed better on easy questions.",
                        ifelse(df_A$Easy_Percent_Correct < df_B$Easy_Percent_Correct,
                               "Student B performed better on easy questions.",
                               "Both students performed equally well on easy questions."))
  
  both_result <- if (df_A$Difficult_Percent_Correct > df_B$Difficult_Percent_Correct &
                      df_A$Easy_Percent_Correct > df_B$Easy_Percent_Correct) {
    "Student A performed better than Student B on both difficult and easy questions."
  } else if (df_B$Difficult_Percent_Correct > df_A$Difficult_Percent_Correct &
               df_B$Easy_Percent_Correct > df_A$Easy_Percent_Correct) {
    "Student B performed better than Student A on both difficult and easy questions."
  } else {
    ""
  }
  
  # Return result strings as a list
  result_list <- list(diff=difficult_result, easy=easy_result, both=both_result)
  
  return(result_list)
}
  

best_on_each_type<-compare_students(Student_A_percentages_on_each ,  Student_B_percentages_on_each)

Student A performed better than Student B on both difficult and easy questions.

g) (10 points)

Explain why this problem is an example of Simpson’s Paradox.

This problem is an example of Simpson’s Paradox because when the data is aggregated overall, it appears that Student B performed better than Student A, with an overall percentage of correct answers of 80% compared to 50% for Student A. However, when the data is subdivided into different segments based on the difficulty level of the questions, it becomes apparent that Student A actually performed better than Student B on both difficult and easy questions.

Specifically, when we look at the breakdown of correct answers by difficulty level, we can see that Student A performed better than Student B on difficult questions (47.37% vs 40%) and also on easy questions (100% vs 82.11%). However, since Student B had a higher overall percentage of correct answers, it might appear at first glance that they performed better overall.

Simpson’s Paradox occurs when the relationship between two variables is reversed or eliminated when a third variable is introduced. In this case, the third variable is the difficulty level of the questions. When we look at the data as a whole, the difficulty level of the questions is not taken into account, and the overall percentage of correct answers can be misleading. However, when we look at the data broken down by difficulty level, we can see that the relationship between the two variables (i.e., which student performed better) is actually reversed

Problem 2. (50 points)

For this problem we are looking at the diamonds dataset. Suppose that the following are outliers: diamonds for which y <=3 or y>=20.

a) (25 points)

Find the average value of y excluding these outliers and have R output this result into the variable ave_y_no_outliers.

# Calculate average y value excluding outliers; (The complement of y <=3 or y>=20 is:  3<`y`<20; "AND"="Intersection"
ave_y_no_outliers <- diamonds %>%
  filter(y > 3 & y < 20) %>%
  summarise(mean_y = mean(y))

The average value of y excluding outliers is 5.7338012

b) (20 points)

Replace these outlier values of y with ave_y_no_outliers and call your new data frame diamonds_with_replaced_y

# Replace outlier values of y with ave_y_no_outliers
diamonds_with_replaced_y <- diamonds %>%
  mutate(y = ifelse(y <= 3 | y >= 20, ave_y_no_outliers$mean_y, y))

Here is a view of our new dataframe, diamonds_with_replaced_y

c) (5 points)

As a check please run summary(diamonds_with_replaced_y$y) and explain why this is a check.

summary(diamonds_with_replaced_y$y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.680   4.720   5.710   5.734   6.540  10.540

This command checks the summary statistics for the y column in the diamonds_with_replaced_y data frame. If the outlier values were properly replaced with ave_y_no_outliers, we should see that the minimum and maximum values of y are now greater than 3 and less than 20, respectively. We should also see that the mean value of y is close to the ave_y_no_outliers value calculated in part a. Therefore, running summary(diamonds_with_replaced_y$y) is a check to ensure that we have properly replaced the outlier values of y with the average value calculated in part a.

Answers to HW 6- Simpson’s Paradox and Imputing Outliers to be Average Values

Solutions. Please contact me if you have questions about any of this.