library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
setwd("C:/Users/njnav/OneDrive/Data 101/Projects")

satgpa<- read_csv("satgpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): sex, sat_v, sat_m, sat_sum, hs_gpa, fy_gpa
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Do students earn similar average percentiles for both the Math and Verbal sections of the SAT or is there a difference?

Introduction

Many students take the SAT every year in hopes of getting a good score to send to colleges. The SAT separates the test into two categories being Math and Verbal (Evidence-Based Reading and Writing). What I wanted to find out is whether students get a higher percentile in either category or if it is relatively the same. The data set I will be using to conduct this research is titled satgpa which is a set of data collected from an unnamed college by The Education Testing Service. I got this data set from the website openintro.org.

This data set has 1000 observations and 6 variables which include the variables I will be using being sat_v and sat_m. This will allow me to determine which category students perform better in.

Data Analysis

The names of the variables I will be using is sat_v, and sat_m.

  1. sat_v: This is the Verbal SAT percentile the student received

  2. sat_m: This is the Math SAT percentile the student received

First I check the head and structure of the data set. I notice that the data set has the gender of the students as 1 and 2 which doesn’t look right so I mutate the data set to make it say F for female and M for male. I then look at the columns to see if there is any NA’s in the data set that I may need to deal with. Luckily there aren’t any NA’s in the data set at all so I don’t need to deal with any. The data set looks good to go.

head(satgpa)
## # A tibble: 6 × 6
##     sex sat_v sat_m sat_sum hs_gpa fy_gpa
##   <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     1    65    62     127   3.4    3.18
## 2     2    58    64     122   4      3.33
## 3     2    56    60     116   3.75   3.25
## 4     1    42    53      95   3.75   2.42
## 5     1    55    52     107   4      2.63
## 6     2    55    56     111   4      2.91
str(satgpa)
## spc_tbl_ [1,000 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ sex    : num [1:1000] 1 2 2 1 1 2 1 1 2 1 ...
##  $ sat_v  : num [1:1000] 65 58 56 42 55 55 57 53 67 41 ...
##  $ sat_m  : num [1:1000] 62 64 60 53 52 56 65 62 77 44 ...
##  $ sat_sum: num [1:1000] 127 122 116 95 107 111 122 115 144 85 ...
##  $ hs_gpa : num [1:1000] 3.4 4 3.75 3.75 4 4 2.8 3.8 4 2.6 ...
##  $ fy_gpa : num [1:1000] 3.18 3.33 3.25 2.42 2.63 2.91 2.83 2.51 3.82 2.54 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   sex = col_double(),
##   ..   sat_v = col_double(),
##   ..   sat_m = col_double(),
##   ..   sat_sum = col_double(),
##   ..   hs_gpa = col_double(),
##   ..   fy_gpa = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
colSums(is.na(satgpa))
##     sex   sat_v   sat_m sat_sum  hs_gpa  fy_gpa 
##       0       0       0       0       0       0
satgpa <- satgpa |>
  mutate(sex = case_when(
    sex == 1 ~ "F",
    sex == 2 ~ "M"
  ))
head(satgpa)
## # A tibble: 6 × 6
##   sex   sat_v sat_m sat_sum hs_gpa fy_gpa
##   <chr> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1 F        65    62     127   3.4    3.18
## 2 M        58    64     122   4      3.33
## 3 M        56    60     116   3.75   3.25
## 4 F        42    53      95   3.75   2.42
## 5 F        55    52     107   4      2.63
## 6 M        55    56     111   4      2.91

Here I create two new data sets called verbal_percent and math_percent to get the average, median, standard deviation, maximum, and minimum percentiles for both verbal and math SAT scores. I first went into the satgpa data set and selected the sat_v and sat_m columns. I then summarized the data to get the average, median, standard deviation, maximum, and minimum percentiles for both the math and verbal sections of the SAT. This allowed me to notice any differences between the two.

verbal_percent <- satgpa |>
  select(sat_v) |>
  summarize(
    Mean_Verbal = mean(sat_v), 
    Median_Verbal = median(sat_v),
    sd_Verbal = sd(sat_v),
    Min_Verbal = min(sat_v),
    Max_Verbal = max(sat_v))

verbal_percent
## # A tibble: 1 × 5
##   Mean_Verbal Median_Verbal sd_Verbal Min_Verbal Max_Verbal
##         <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1        48.9            49      8.23         24         76
math_percent <- satgpa |>
  select(sat_m) |>
  summarize(
    Mean_Math = mean(sat_m), 
    Median_Math = median(sat_m),
    sd_Math = sd(sat_m),
    Min_Mathl = min(sat_m),
    Max_Math = max(sat_m))

math_percent
## # A tibble: 1 × 5
##   Mean_Math Median_Math sd_Math Min_Mathl Max_Math
##       <dbl>       <dbl>   <dbl>     <dbl>    <dbl>
## 1      54.4          55    8.45        29       77

After making a summary for both groups I made a histogram for both sections to see the distribution of percentiles received by the students.

ggplot(satgpa, aes(x = sat_v)) +
  geom_histogram(binwidth = 2, fill = "purple", color = "black") +
  labs(title = "Histogram of Verbal Percentiles", x = "Verbal Percentile", y = "Number of Students") +
  theme_minimal()

ggplot(satgpa, aes(x = sat_m)) +
  geom_histogram(binwidth = 2, fill = "pink", color = "black") +
  labs(title = "Histogram of Math Percentiles", x = "Math Percentile", y = "Number of Students") +
  theme_minimal()

Statistical Analysis

Do students earn similar average percentiles for both the Math and Verbal sections of the SAT or is there a difference?

Hypothesis

\(H_0\): \(\mu_m\) = \(\mu_v\)

\(H_a\): \(\mu_m\) \(\neq\) \(\mu_v\)

Where:

\(\mu_m\) = Average Math percentile scored by students

\(\mu_v\) = Average Verbal percentile scored by students

Significance level

α = 0.05

t.test(satgpa$sat_m, satgpa$sat_v, alternative = "two.sided", conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  satgpa$sat_m and satgpa$sat_v
## t = 14.637, df = 1996.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.729299 6.192701
## sample estimates:
## mean of x mean of y 
##    54.395    48.934

As the p-value = 2.2e-16 which is less than 0.05 we reject the null. The p-value is statistically significant at α = 0.05. We are 95% confident that the difference between the average Math percentile scored by students and the average Verbal percentile scored by students is between 4.73 and 6.19. The confidence interval does not include 0 meaning there is statistical significance. We have enough evidence at a 0.05 significance level that the average Math percentile scored by students is not equal to the average Verbal percentile scored by students.

Conclusion and Future Directions

Using all the data I collected I can say that there is indeed a difference in the average percentile scored by students for both the Math and Verbal portions of the SAT as seen during the statistical analysis. Going through the data I also noticed that most students seemed to have gotten higher percentiles in the Math section as there was a higher median and mean compared to the Verbal section. This can be seen in the histograms as the data for Math peaks around 55 while the data for Verbal peaks around 47. If we can possibly get a bigger sample size in the future we may be able to determine if there is a bigger or smaller difference between the percentiles for the two sections. There is also the idea of seeing if a certain gender can score higher percentiles compared to the other which could be interesting to look into.

References

Data set found from openintro.org at “https://www.openintro.org/data/index.php?data=satgpa” Collected by the Educational Testing Service