The high school data I am analyzing was compiled and published by the New York City Department of Education, with SAT score averages and testing rates provided by the College Board. This data is crucial for addressing my research question: “Is there a relationship between average sat scores in maths,reading,writing,school name and the city
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(treemap)
library(RColorBrewer)
library(ggfortify)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
setwd("C:/Users/eyong/OneDrive - montgomerycollege.edu/Desktop/data110/New folder")
df<-read_csv("scores.csv")
## Rows: 435 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): School ID, School Name, Borough, Building Code, Street Address, C...
## dbl (7): Zip Code, Latitude, Longitude, Student Enrollment, Average Score ...
## time (2): Start Time, End Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df)
## School ID School Name Borough Building Code
## Length:435 Length:435 Length:435 Length:435
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Street Address City State Zip Code
## Length:435 Length:435 Length:435 Min. :10001
## Class :character Class :character Class :character 1st Qu.:10302
## Mode :character Mode :character Mode :character Median :10472
## Mean :10726
## 3rd Qu.:11217
## Max. :11694
##
## Latitude Longitude Phone Number Start Time
## Min. :40.53 Min. :-74.19 Length:435 Length:435
## 1st Qu.:40.69 1st Qu.:-73.98 Class :character Class1:hms
## Median :40.74 Median :-73.93 Mode :character Class2:difftime
## Mean :40.75 Mean :-73.92 Mode :numeric
## 3rd Qu.:40.82 3rd Qu.:-73.88
## Max. :40.89 Max. :-73.73
##
## End Time Student Enrollment Percent White Percent Black
## Length:435 Min. : 62.0 Length:435 Length:435
## Class1:hms 1st Qu.: 357.2 Class :character Class :character
## Class2:difftime Median : 466.0 Mode :character Mode :character
## Mode :numeric Mean : 702.4
## 3rd Qu.: 623.2
## Max. :5447.0
## NA's :7
## Percent Hispanic Percent Asian Average Score (SAT Math)
## Length:435 Length:435 Min. :317.0
## Class :character Class :character 1st Qu.:386.0
## Mode :character Mode :character Median :415.0
## Mean :432.9
## 3rd Qu.:458.5
## Max. :754.0
## NA's :60
## Average Score (SAT Reading) Average Score (SAT Writing) Percent Tested
## Min. :302.0 Min. :284.0 Length:435
## 1st Qu.:386.0 1st Qu.:382.0 Class :character
## Median :413.0 Median :403.0 Mode :character
## Mean :424.5 Mean :418.5
## 3rd Qu.:445.0 3rd Qu.:437.5
## Max. :697.0 Max. :693.0
## NA's :60 NA's :60
gf<-janitor::clean_names(df)
cleaned_data <- gf %>%
filter(if_all(c("average_score_sat_math","average_score_sat_reading","percent_tested","city","school_name"), ~ !is.na(.)))
print(cleaned_data)
## # A tibble: 375 × 22
## school_id school_name borough building_code street_address city state
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 01M539 New Explorations … Manhat… M022 111 Columbia … Manh… NY
## 2 02M294 Essex Street Acad… Manhat… M445 350 Grand Str… Manh… NY
## 3 02M308 Lower Manhattan A… Manhat… M445 350 Grand Str… Manh… NY
## 4 02M545 High School for D… Manhat… M445 350 Grand Str… Manh… NY
## 5 01M292 Henry Street Scho… Manhat… M056 220 Henry Str… Manh… NY
## 6 01M696 Bard High School … Manhat… M097 525 East Hous… Manh… NY
## 7 02M305 Urban Assembly Ac… Manhat… M445 350 Grand Str… Manh… NY
## 8 01M509 Marta Valle High … Manhat… M025 145 Stanton S… Manh… NY
## 9 01M448 University Neighb… Manhat… M446 200 Monroe St… Manh… NY
## 10 02M543 New Design High S… Manhat… M445 350 Grand Str… Manh… NY
## # ℹ 365 more rows
## # ℹ 15 more variables: zip_code <dbl>, latitude <dbl>, longitude <dbl>,
## # phone_number <chr>, start_time <time>, end_time <time>,
## # student_enrollment <dbl>, percent_white <chr>, percent_black <chr>,
## # percent_hispanic <chr>, percent_asian <chr>, average_score_sat_math <dbl>,
## # average_score_sat_reading <dbl>, average_score_sat_writing <dbl>,
## # percent_tested <chr>
score <- cleaned_data[, c("average_score_sat_math","average_score_sat_reading","city","average_score_sat_writing")]
correlation <- cor(score$average_score_sat_math, score$average_score_sat_reading)
print(paste("Correlation between average SAT Math and Reading scores:", round(correlation, 4)))
## [1] "Correlation between average SAT Math and Reading scores: 0.9282"
ggplot(score, aes(x = average_score_sat_math, y = average_score_sat_reading)) +
geom_point(aes(color = city), alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Correlation between Average SAT Math and Reading Scores",
x = "Average SAT Math Score",
y = "Average SAT Reading Score") +
theme_minimal() +
theme(legend.position = "right")
## `geom_smooth()` using formula = 'y ~ x'
The correlation value of 0.9282 indicates a strong relationship between the average SAT math and reading scores, as it is very close to one. This suggests that as scores in one subject increase, scores in the other tend to increase as well, highlighting the interdependence of these two academic areas.
score_long <- pivot_longer(score,
cols = c("average_score_sat_math", "average_score_sat_reading","average_score_sat_writing"),
names_to = "Score_Type",
values_to = "Score_Value")
library(RColorBrewer)
ggplot(score_long, aes(x = city, y = Score_Type, fill = Score_Value)) +
geom_tile(color = "white") +
scale_fill_gradientn(colors = brewer.pal(11, "RdYlBu")) + # Use RColorBrewer
labs(title = "Heatmap of SAT Scores by City",
x = "City",
y = "Score Type",
fill = "Score Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The heatmap illustrates that Manhattan has high SAT scores in math, writing, and reading, suggesting that this city may be an ideal choice for a child aspiring to succeed in school.
score$city <- as.factor(score$city)
model <- lm(average_score_sat_writing ~ average_score_sat_math + average_score_sat_reading, data = score)
summary(model)
##
## Call:
## lm(formula = average_score_sat_writing ~ average_score_sat_math +
## average_score_sat_reading, data = score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.108 -6.571 0.207 6.614 31.714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.71851 3.78848 -3.885 0.000121 ***
## average_score_sat_math 0.12598 0.02023 6.227 1.29e-09 ***
## average_score_sat_reading 0.89194 0.02352 37.915 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.47 on 372 degrees of freedom
## Multiple R-squared: 0.9738, Adjusted R-squared: 0.9737
## F-statistic: 6918 on 2 and 372 DF, p-value: < 2.2e-16
This regression analysis seeks to determine whether the average SAT writing score is influenced by the average SAT scores in math and reading. The R-squared value is 0.9737, indicating a strong correlation closer to 1, which suggests a high significance in the relationship. Therefore, we can conclude that writing scores are indeed affected by both math and reading scores.
par(mfrow=c(2,2))
plot(model)
The correlation value of 0.9282 indicates a strong relationship between the average SAT math and reading scores, suggesting that as scores in one subject increase, scores in the other tend to rise as well, highlighting the interdependence of these academic areas. The heatmap reveals that Manhattan shows high SAT scores in math, writing, and reading, indicating it may be an ideal city for a child aspiring to succeed in school. Additionally, the regression analysis aims to determine whether the average SAT writing score is influenced by the average SAT scores in math and reading. With an R-squared value of 0.9737, this indicates a strong correlation, suggesting that writing scores are significantly affected by both math and reading scores.
In my analysis, I used the janitor library to clean the dataset, which involved fixing column names that contained spaces and capital letters. This process was necessary to make the dataset more manageable, as it was a bit of a headache initially. Once I resolved this issue, cleaning the data by removing NA values from relevant columns was straightforward.
published by the New York City Department of Education
this is the link to kaggle Average SAT Scores for NYC Public Schools (kaggle.com)