knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(readxl)
library(knitr)
Namedata <- read_xlsx("~/Desktop/MTPracscores.xlsx", sheet = "student_info_practice")
Scoredata <- read_xlsx("~/Desktop/MTPracscores.xlsx", sheet = "student_scores_practice")
head(Namedata)
## # A tibble: 6 × 3
## Name Sex `NYC County (Borough)`
## <chr> <chr> <chr>
## 1 Lucas Nguyen Male Queenz
## 2 Aisha Patel Female Brooklyn
## 3 Mateo Rivera Male Bronx
## 4 Naomi Kim Female Manhattan
## 5 Elijah Brooks Male SI
## 6 Sofia Rossi Female Queenzz
head(Scoredata)
## # A tibble: 6 × 5
## `First Name` `Last Name` `Final Test Scores` `Practice Test Scores`
## <chr> <chr> <dbl> <dbl>
## 1 Lucas Nguyen 87 94
## 2 Aisha Patel 64 NA
## 3 Mateo Rivera 92 83
## 4 Naomi Kim 78 68
## 5 Elijah Brooks 55 72
## 6 Sofia Rossi 100 100
## # ℹ 1 more variable: `School Type` <chr>
Scoredata <- Scoredata %>%
mutate(Name = paste(`First Name`, `Last Name`, sep = " "))
head(Scoredata$Name)
## [1] "Lucas Nguyen" "Aisha Patel" "Mateo Rivera" "Naomi Kim"
## [5] "Elijah Brooks" "Sofia Rossi"
Namecombined <- left_join(Namedata, Scoredata, by = "Name")
library(janitor)
Namecombined <- clean_names(Namecombined)
head(Namecombined)
## # A tibble: 6 × 8
## name sex nyc_county_borough first_name last_name final_test_scores
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Lucas Nguyen Male Queenz Lucas Nguyen 87
## 2 Aisha Patel Female Brooklyn Aisha Patel 64
## 3 Mateo Rivera Male Bronx Mateo Rivera 92
## 4 Naomi Kim Female Manhattan Naomi Kim 78
## 5 Elijah Brooks Male SI Elijah Brooks 55
## 6 Sofia Rossi Female Queenzz Sofia Rossi 100
## # ℹ 2 more variables: practice_test_scores <dbl>, school_type <chr>
student_data <- Namecombined
cleanedborodata <- student_data %>%
mutate(nyc_county_borough = case_when(
nyc_county_borough %in% c("BK", "Bk") ~ "Brooklyn",
nyc_county_borough %in% c("Q", "Queens", "Queenz","Queenzz") ~ "Queens",
nyc_county_borough %in% c("BX", "Bx") ~ "Bronx",
nyc_county_borough %in% c("SI") ~ "Staten Island",
nyc_county_borough%in% c("M") ~ "Manhattan",
TRUE ~ nyc_county_borough
))
library(mosaic)
favstats(~ final_test_scores, data = cleanedborodata)
## min Q1 median Q3 max mean sd n missing
## 48 65.25 77.5 87.75 100 76.4 13.98542 50 0
The mean of these final test scores is 76.4,
The maximum is 100,
The minimum is 48,
The standard deviation is 13.99
ggplot(cleanedborodata, aes(x = practice_test_scores, y = final_test_scores)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Relationship Between Practice and Final Test Scores",
x = "Practice Test Scores",
y = "Final Test Scores"
) +
theme_minimal()
Based on this plot, practice test scores don’t appear to be particularly predictive of final test scores. There is quite a bit of unexplained variation.
t.test(final_test_scores ~ sex, data = cleanedborodata)
##
## Welch Two Sample t-test
##
## data: final_test_scores by sex
## t = 1.1151, df = 47.92, p-value = 0.2704
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -3.534074 12.334074
## sample estimates:
## mean in group Female mean in group Male
## 78.6 74.2
On average, the females scored higher on the practice test. But, the difference is not even close to being statistically significant.
anova_school <- aov(final_test_scores ~ school_type, data = student_data)
summary(anova_school)
## Df Sum Sq Mean Sq F value Pr(>F)
## school_type 2 237 118.3 0.595 0.556
## Residuals 47 9347 198.9
anova_borough <- aov(final_test_scores ~ nyc_county_borough, data = student_data)
summary(anova_borough)
## Df Sum Sq Mean Sq F value Pr(>F)
## nyc_county_borough 13 3704 284.9 1.744 0.0932 .
## Residuals 36 5880 163.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
School borough is clearly more important than type, but neither metric is statistically significant. There is no significant effect of school type on final test scores, F(2,47) = 0.60, p = .56 and the effect of borough is marginal, F(13,36) = 1.74, p = .09. About 2.5% of variance is explained by school type (237/9,584),whereas 38.7% of variance is explained by borough (3,704/9,584) Overall, there is compelling evidence that effect of borough is worth further consideration, due the amount of variance it explains and its marginal p-value–but this study does not confirm its significance.
Overall, I didn’t find the creation of the code all that challenging (since I have many snippets from past assignments). I find that, the more I use R (particularly R Markdown) the easier it is for me to troubleshoot any particular issues. But, one logistical piece of this that was a bit confusing to me was coordinating the r setup so that it displayed the code (without displaying all the errors) and resolving the “zero length variable” error.
For me, the most challenging part of this assignment was interpreting the actual data. While I found it very easy to find and report the P-Values and the PRE, I wasn’t always sure how these statistics should be related to real world situations. For instance, I wasn’t sure if .09 should be considered a “bad p-value” since it was >.05, or a notable one since (by most metrics) this experiment would be considered statistically under-powered. The fact that this explained about 38% of the variance confused me further. This seemed high enough to warrant mention (particularly in social science), but not necessarily “appropriate” to mention in conjunction with a marginal p-value. Together, the only conclusion I could reach was that borough should be studied more, but that the evidence was inconclusive.
Often I find that, when I think I’m confused about some mechanism of coding, I’m actually confused about a statistical concept. When I have a clear idea of what I want to see/find, I can almost always find the code I need somewhere. So, I actually think that I would like to focus most on improving my fundamental understanding of certain statistical ideas before the midterm–such as power and effect size.