My dataset covers SAT scores and GPAs of high school students along with their first year college GPA. My variables include the sex of the student, their SAT verbal score as a percentile, SAT math score as a percentile and the sum of these two scores as a percentile, along with their high school GPA and first year college GPA. I’m going to be looking into the correlation between a student’s total SAT score as a percentile to their first year college GPA.
#Load necessary packages, read csv file
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggpubr)
satgpa <- read_csv("satgpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): sex, sat_v, sat_m, sat_sum, hs_gpa, fy_gpa
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Rewrite 1 and 2 as Male and Female respectively
satgpa <- mutate(satgpa, sex = factor(sex, levels = c(1,2), labels = c("Male", "Female")))
#Create scatterplot and correlate variables
ggplot(satgpa, aes(x=sat_sum, y=fy_gpa, color = sex)) + geom_point(size=1.5, shape = 16) +
scale_color_manual(values = c("Male" = "#62c45e", "Female" = "#0a26ad")) +
stat_cor(method = "pearson") +
labs(
title = "First Year GPA vs. SAT Total",
x = "SAT Total (Percentile)",
y = "First Year College GPA",
caption = "Source: Educational Testing Service",
) +
theme_dark()
By using the mutate command, I was able to convert 1 and 2 to Male and Female to better understand and represent the data. In the scatter plot, a moderate correlation was found of about 0.48. Since SAT scores are commonly cited as having high predictive power of future college success, this finding wasn’t too surprising to me. Despite their high correlation, I suspect that high school GPA may actually correlate more highly to college GPA than SAT score would correlate to college GPA, but I wasn’t sure how I could include that as well into my visualization. I also was hoping to include the r squared value into my visualization, but unfortunately I wasn’t able to do so.