Ilina Maksimovska

Research question: Whether there is a difference in state spending on public education per person between U.S. States that have average full SAT scores (SAT Verbal + SAT Math) above the median and the ones with average full SAT scores (SAT Verbal + SAT Math) below the median?

Assumptions that need to be met:

data()
data(package = .packages(all.available = TRUE))
#install.packages("carData")
library(carData) #Activating the package carData
## Warning: package 'carData' was built under R version 4.3.2
mydata <- force(States) #Importing a data set called "States" and creating a data frame
head(mydata) #Showing first 6 rows of the newly created data frame
##    region   pop SATV SATM percent dollars pay
## AL    ESC  4041  470  514       8   3.648  27
## AK    PAC   550  438  476      42   7.887  43
## AZ    MTN  3665  445  497      25   4.231  30
## AR    WSC  2351  470  511       6   3.334  23
## CA    PAC 29760  419  484      45   4.826  39
## CO    MTN  3294  456  513      28   4.809  31

Explanation of a data set:

The States data frame contains 51 observations of 7 variables. The observations are the U. S. states and Washington, D. C.

Source of data set: United States (1992) Statistical Abstract of the United States. Bureau of the Census.

Explanation of the variables in the States data set:

colnames(mydata) <- c("Region", "Population", "SATVerbal", "SATMath", "Percent", "StateSpending", "Salary") #Renaming variables in data frame mydata
head(mydata)
##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AL    ESC       4041       470     514       8         3.648     27
## AK    PAC        550       438     476      42         7.887     43
## AZ    MTN       3665       445     497      25         4.231     30
## AR    WSC       2351       470     511       6         3.334     23
## CA    PAC      29760       419     484      45         4.826     39
## CO    MTN       3294       456     513      28         4.809     31
mydata$FullSATScore <- (mydata$SATVerbal + mydata$SATMath)  #Creating new variable that shows the average full SAT score (SAT Verbal + SAT Math) of each state
head(mydata)
##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AL    ESC       4041       470     514       8         3.648     27
## AK    PAC        550       438     476      42         7.887     43
## AZ    MTN       3665       445     497      25         4.231     30
## AR    WSC       2351       470     511       6         3.334     23
## CA    PAC      29760       419     484      45         4.826     39
## CO    MTN       3294       456     513      28         4.809     31
##    FullSATScore
## AL          984
## AK          914
## AZ          942
## AR          981
## CA          903
## CO          969
summary(mydata) #Interpretation of descriptive statistics
##      Region     Population      SATVerbal        SATMath     
##  SA     : 9   Min.   :  454   Min.   :397.0   Min.   :437.0  
##  MTN    : 8   1st Qu.: 1215   1st Qu.:422.5   1st Qu.:470.0  
##  WNC    : 7   Median : 3294   Median :443.0   Median :490.0  
##  NE     : 6   Mean   : 4877   Mean   :448.2   Mean   :497.4  
##  ENC    : 5   3rd Qu.: 5780   3rd Qu.:474.5   3rd Qu.:522.5  
##  PAC    : 5   Max.   :29760   Max.   :511.0   Max.   :577.0  
##  (Other):11                                                  
##     Percent      StateSpending       Salary       FullSATScore   
##  Min.   : 4.00   Min.   :2.993   Min.   :22.00   Min.   : 834.0  
##  1st Qu.:11.50   1st Qu.:4.354   1st Qu.:27.50   1st Qu.: 893.0  
##  Median :25.00   Median :5.045   Median :30.00   Median : 933.0  
##  Mean   :33.75   Mean   :5.175   Mean   :30.94   Mean   : 945.5  
##  3rd Qu.:57.50   3rd Qu.:5.689   3rd Qu.:33.50   3rd Qu.: 994.5  
##  Max.   :74.00   Max.   :9.159   Max.   :43.00   Max.   :1088.0  
## 
library(psych)

describe(mydata) #Interpretation of descriptive statistics
##               vars  n    mean      sd  median trimmed     mad    min
## Region*          1 51    5.27    2.45    5.00    5.37    2.97   1.00
## Population       2 51 4876.65 5439.20 3294.00 3813.15 3239.48 454.00
## SATVerbal        3 51  448.16   30.82  443.00  447.29   37.06 397.00
## SATMath          4 51  497.39   34.57  490.00  496.51   40.03 437.00
## Percent          5 51   33.75   24.07   25.00   32.76   28.17   4.00
## StateSpending    6 51    5.18    1.38    5.04    5.02    1.03   2.99
## Salary           7 51   30.94    5.31   30.00   30.63    4.45  22.00
## FullSATScore     8 51  945.55   64.77  933.00  943.85   74.13 834.00
##                    max    range  skew kurtosis     se
## Region*           9.00     8.00 -0.25    -1.11   0.34
## Population    29760.00 29306.00  2.41     7.06 761.64
## SATVerbal       511.00   114.00  0.18    -1.11   4.32
## SATMath         577.00   140.00  0.23    -0.86   4.84
## Percent          74.00    70.00  0.22    -1.63   3.37
## StateSpending     9.16     6.17  0.97     0.71   0.19
## Salary           43.00    21.00  0.51    -0.49   0.74
## FullSATScore   1088.00   254.00  0.22    -1.00   9.07

Interpretation of the descriptive statistics for data frame mydata

mydata$FullSATScoreFactor<- factor(ifelse(mydata$FullSATScore < 933, 0, 1),
                                      levels = c(0, 1), 
                                      labels = c("Below Median", "Above Median")) #Creating a factor that divides the U.S. States based on their full average SAT Score
library(psych)
describeBy(mydata$StateSpending, g = mydata$FullSATScoreFactor)
## 
##  Descriptive statistics by group 
## group: Below Median
##    vars  n mean   sd median trimmed  mad  min  max range skew
## X1    1 25 5.98 1.39    5.5    5.87 1.01 4.24 9.16  4.92  0.8
##    kurtosis   se
## X1     -0.6 0.28
## ---------------------------------------------------- 
## group: Above Median
##    vars  n mean   sd median trimmed  mad  min  max range  skew
## X1    1 26 4.41 0.82    4.4     4.4 1.01 2.99 5.95  2.95 -0.01
##    kurtosis   se
## X1    -1.23 0.16

Interpretation of the descriptive statistics

Group 1: Below median

Group 2: Above median

Since StateSpending is a numerical variable, the first assumption is not violated. Following, we will check the normality of distribution of the data through creating a histogram and the Shapiro Wilk test in order to check the second assumption.

library(ggplot2) 
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = StateSpending)) +
  geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
  facet_wrap(~FullSATScoreFactor, ncol = 1) + 
  ylab("Frequency") #Creating histogram to show the distribution

library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(FullSATScoreFactor) %>%
  shapiro_test(StateSpending) #Shapiro Wilk test to check normality of the data
## # A tibble: 2 × 4
##   FullSATScoreFactor variable      statistic      p
##   <fct>              <chr>             <dbl>  <dbl>
## 1 Below Median       StateSpending     0.898 0.0165
## 2 Above Median       StateSpending     0.959 0.378

Below Median

H0: Variable is normally distributed

H1: Variable is not normally distributed

We reject the null hypothesis at p - value = 0.017 and assume that the variable StateSpending is not normally distributed among the units with below the median full SAT scores.

Above Median

H0: Variable is normally distributed

H1: Variable is not normally distributed

We accept the null hypothesis at p - value = 0.38 and assume that the variable StateSpending is normally distributed among the units with above the median full SAT scores.

Since the second assumption is violated, I have to use nonparametric test. However, now I will proceed with parametric test without interpreting it in order to show the steps.

Parametric test (Independent t-test with Welch correction)

t.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
         paired= FALSE,
         var.equal = FALSE,
         alternative = "two.sided") 
## 
##  Welch Two Sample t-test
## 
## data:  mydata$StateSpending by mydata$FullSATScoreFactor
## t = 4.8883, df = 38.449, p-value = 1.835e-05
## alternative hypothesis: true difference in means between group Below Median and group Above Median is not equal to 0
## 95 percent confidence interval:
##  0.9205684 2.2211485
## sample estimates:
## mean in group Below Median mean in group Above Median 
##                   5.976320                   4.405462
library (effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize ::cohens_d(mydata$StateSpending ~ mydata$FullSATScoreFactor,
                    pooled_sd - FALSE)
## Cohen's d |       95% CI
## ------------------------
## 1.38      | [0.76, 1.99]
## 
## - Estimated using pooled SD.
interpret_cohens_d (1.38, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)

Nonparametric test (Wilcoxon Rank Sum Test)

wilcox.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided") 
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$StateSpending by mydata$FullSATScoreFactor
## W = 541, p-value = 4.703e-05
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location of state spending is the same for the U.S. States with average SAT scores above the median and the ones below the median.

H1: Distribution location of state spending is different for the U.S. States with average SAT scores above the median and the ones below the median.

We reject the null hypothesis at p-value < 0.001 and assume that distribution location of state spending is different for the U.S. States with full SAT scores above the median and the ones below the median.

library(effectsize)
effectsize(wilcox.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |       95% CI
## --------------------------------
## 0.66              | [0.45, 0.81]
interpret_rank_biserial(0.66)
## [1] "very large"
## (Rules: funder2019)

Based on the sample data, we find that U.S. States with average full SAT scores (SAT Verbal and SAT Math) above the median differ in the amount of money spent on public education compared to those with average full SAT scores below the median (p < 0.001). U.S. States with average SAT results below the median spend more funds, and the difference in distribution is very large (𝑟 = 0.66). However, this could be attributed to the fact that the States with full SAT scores above the median have lower percentages of graduating high-school students who took the SAT exam compared to the States with full SAT scores below the median.