Intro and Set-Up

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(readxl)
library(knitr)

Namedata <- read_xlsx("~/Desktop/MTPracscores.xlsx", sheet = "student_info_practice")
Scoredata <- read_xlsx("~/Desktop/MTPracscores.xlsx", sheet = "student_scores_practice")

head(Namedata)
## # A tibble: 6 × 3
##   Name          Sex    `NYC County (Borough)`
##   <chr>         <chr>  <chr>                 
## 1 Lucas Nguyen  Male   Queenz                
## 2 Aisha Patel   Female Brooklyn              
## 3 Mateo Rivera  Male   Bronx                 
## 4 Naomi Kim     Female Manhattan             
## 5 Elijah Brooks Male   SI                    
## 6 Sofia Rossi   Female Queenzz
head(Scoredata)
## # A tibble: 6 × 5
##   `First Name` `Last Name` `Final Test Scores` `Practice Test Scores`
##   <chr>        <chr>                     <dbl>                  <dbl>
## 1 Lucas        Nguyen                       87                     94
## 2 Aisha        Patel                        64                     NA
## 3 Mateo        Rivera                       92                     83
## 4 Naomi        Kim                          78                     68
## 5 Elijah       Brooks                       55                     72
## 6 Sofia        Rossi                       100                    100
## # ℹ 1 more variable: `School Type` <chr>

Merging Data

Scoredata <- Scoredata %>%
  mutate(Name = paste(`First Name`, `Last Name`, sep = " "))
head(Scoredata$Name)
## [1] "Lucas Nguyen"  "Aisha Patel"   "Mateo Rivera"  "Naomi Kim"    
## [5] "Elijah Brooks" "Sofia Rossi"
Namecombined <- left_join(Namedata, Scoredata, by = "Name")

library(janitor) 

Namecombined <- clean_names(Namecombined)
head(Namecombined)
## # A tibble: 6 × 8
##   name          sex    nyc_county_borough first_name last_name final_test_scores
##   <chr>         <chr>  <chr>              <chr>      <chr>                 <dbl>
## 1 Lucas Nguyen  Male   Queenz             Lucas      Nguyen                   87
## 2 Aisha Patel   Female Brooklyn           Aisha      Patel                    64
## 3 Mateo Rivera  Male   Bronx              Mateo      Rivera                   92
## 4 Naomi Kim     Female Manhattan          Naomi      Kim                      78
## 5 Elijah Brooks Male   SI                 Elijah     Brooks                   55
## 6 Sofia Rossi   Female Queenzz            Sofia      Rossi                   100
## # ℹ 2 more variables: practice_test_scores <dbl>, school_type <chr>
student_data <- Namecombined

Borough Cleaning

cleanedborodata <- student_data %>%
  mutate(nyc_county_borough = case_when(
    nyc_county_borough %in% c("BK", "Bk") ~ "Brooklyn",
    nyc_county_borough %in% c("Q", "Queens", "Queenz","Queenzz") ~ "Queens",
    nyc_county_borough %in% c("BX", "Bx") ~ "Bronx",
    nyc_county_borough %in% c("SI") ~ "Staten Island",
    nyc_county_borough%in% c("M") ~ "Manhattan",
    TRUE ~ nyc_county_borough
  ))

Descriptive Stats

library(mosaic)
favstats(~ final_test_scores, data = cleanedborodata)
##  min    Q1 median    Q3 max mean       sd  n missing
##   48 65.25   77.5 87.75 100 76.4 13.98542 50       0

The mean of these final test scores is 76.4,
The maximum is 100,
The minimum is 48,
The standard deviation is 13.99

Scatterplot

ggplot(cleanedborodata, aes(x = practice_test_scores, y = final_test_scores)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Relationship Between Practice and Final Test Scores",
    x = "Practice Test Scores",
    y = "Final Test Scores"
  ) +
  theme_minimal()

Based on this plot, practice test scores don’t appear to be particularly predictive of final test scores. There is quite a bit of unexplained variation.

T-Test

t.test(final_test_scores ~ sex, data = cleanedborodata)
## 
##  Welch Two Sample t-test
## 
## data:  final_test_scores by sex
## t = 1.1151, df = 47.92, p-value = 0.2704
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -3.534074 12.334074
## sample estimates:
## mean in group Female   mean in group Male 
##                 78.6                 74.2

On average, the females scored higher on the practice test. But, the difference is not even close to being statistically significant.

ANOVA Scores

anova_school <- aov(final_test_scores ~ school_type, data = student_data)
summary(anova_school)
##             Df Sum Sq Mean Sq F value Pr(>F)
## school_type  2    237   118.3   0.595  0.556
## Residuals   47   9347   198.9
anova_borough <- aov(final_test_scores ~ nyc_county_borough, data = student_data)
summary(anova_borough)
##                    Df Sum Sq Mean Sq F value Pr(>F)  
## nyc_county_borough 13   3704   284.9   1.744 0.0932 .
## Residuals          36   5880   163.3                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

School borough is clearly more important than type, but neither metric is statistically significant. There is no significant effect of school type on final test scores, F(2,47) = 0.60, p = .56 and the effect of borough is marginal, F(13,36) = 1.74, p = .09. About 2.5% of variance is explained by school type (237/9,584),whereas 38.7% of variance is explained by borough (3,704/9,584) Overall, there is compelling evidence that effect of borough is worth further consideration, due the amount of variance it explains and its marginal p-value–but this study does not confirm its significance.

Reflection

Overall, I didn’t find the creation of the code all that challenging (since I have many snippets from past assignments). I find that, the more I use R (particularly R Markdown) the easier it is for me to troubleshoot any particular issues. But, one logistical piece of this that was a bit confusing to me was coordinating the r setup so that it displayed the code (without displaying all the errors) and resolving the “zero length variable” error.

For me, the most challenging part of this assignment was interpreting the actual data. While I found it very easy to find and report the P-Values and the PRE, I wasn’t always sure how these statistics should be related to real world situations. For instance, I wasn’t sure if .09 should be considered a “bad p-value” since it was >.05, or a notable one since (by most metrics) this experiment would be considered statistically under-powered. The fact that this explained about 38% of the variance confused me further. This seemed high enough to warrant mention (particularly in social science), but not necessarily “appropriate” to mention in conjunction with a marginal p-value. Together, the only conclusion I could reach was that borough should be studied more, but that the evidence was inconclusive.

Often I find that, when I think I’m confused about some mechanism of coding, I’m actually confused about a statistical concept. When I have a clear idea of what I want to see/find, I can almost always find the code I need somewhere. So, I actually think that I would like to focus most on improving my fundamental understanding of certain statistical ideas before the midterm–such as power and effect size.