Assignment 1

Ilina Maksimovska

Research question: Whether there is a difference in state spending on public education per person between U.S. States that have average full SAT scores (SAT Verbal + SAT Math) above the median and the ones with average full SAT scores (SAT Verbal + SAT Math) below the median?

Assumptions that need to be met:

Variable is numeric
The distribution of the variable is normal in both populations
The data must come from two independent populations
Variable has the same variance in both populations – since this assumption is often violated, we apply Welch correction.

data()
data(package = .packages(all.available = TRUE))

#install.packages("carData")
library(carData) #Activating the package carData

## Warning: package 'carData' was built under R version 4.3.2

mydata <- force(States) #Importing a data set called "States" and creating a data frame
head(mydata) #Showing first 6 rows of the newly created data frame

##    region   pop SATV SATM percent dollars pay
## AL    ESC  4041  470  514       8   3.648  27
## AK    PAC   550  438  476      42   7.887  43
## AZ    MTN  3665  445  497      25   4.231  30
## AR    WSC  2351  470  511       6   3.334  23
## CA    PAC 29760  419  484      45   4.826  39
## CO    MTN  3294  456  513      28   4.809  31

Explanation of a data set:

The States data frame contains 51 observations of 7 variables. The observations are the U. S. states and Washington, D. C.

Source of data set: United States (1992) Statistical Abstract of the United States. Bureau of the Census.

Explanation of the variables in the States data set:

region: U.S. Census regions.
pop: The population of each state, measured in thousands (1,000s).
SATV: This represents the average score achieved by graduating high-school students in each state on the verbal component of the Scholastic Aptitude Test (SAT), a widely-recognized university admission exam.
SATM: The average score of graduating high-school students in each state on the math component of the Scholastic Aptitude Test (SAT).
percent: The percentage of graduating high-school students in each state who took the SAT exam.
dollars: State spending on public education, reported in thousands of dollars per student.
pay: The average salary of teachers in each state, measured in thousands of dollars.

colnames(mydata) <- c("Region", "Population", "SATVerbal", "SATMath", "Percent", "StateSpending", "Salary") #Renaming variables in data frame mydata
head(mydata)

##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AL    ESC       4041       470     514       8         3.648     27
## AK    PAC        550       438     476      42         7.887     43
## AZ    MTN       3665       445     497      25         4.231     30
## AR    WSC       2351       470     511       6         3.334     23
## CA    PAC      29760       419     484      45         4.826     39
## CO    MTN       3294       456     513      28         4.809     31

mydata$FullSATScore <- (mydata$SATVerbal + mydata$SATMath)  #Creating new variable that shows the average full SAT score (SAT Verbal + SAT Math) of each state
head(mydata)

##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AL    ESC       4041       470     514       8         3.648     27
## AK    PAC        550       438     476      42         7.887     43
## AZ    MTN       3665       445     497      25         4.231     30
## AR    WSC       2351       470     511       6         3.334     23
## CA    PAC      29760       419     484      45         4.826     39
## CO    MTN       3294       456     513      28         4.809     31
##    FullSATScore
## AL          984
## AK          914
## AZ          942
## AR          981
## CA          903
## CO          969

summary(mydata) #Interpretation of descriptive statistics

##      Region     Population      SATVerbal        SATMath     
##  SA     : 9   Min.   :  454   Min.   :397.0   Min.   :437.0  
##  MTN    : 8   1st Qu.: 1215   1st Qu.:422.5   1st Qu.:470.0  
##  WNC    : 7   Median : 3294   Median :443.0   Median :490.0  
##  NE     : 6   Mean   : 4877   Mean   :448.2   Mean   :497.4  
##  ENC    : 5   3rd Qu.: 5780   3rd Qu.:474.5   3rd Qu.:522.5  
##  PAC    : 5   Max.   :29760   Max.   :511.0   Max.   :577.0  
##  (Other):11                                                  
##     Percent      StateSpending       Salary       FullSATScore   
##  Min.   : 4.00   Min.   :2.993   Min.   :22.00   Min.   : 834.0  
##  1st Qu.:11.50   1st Qu.:4.354   1st Qu.:27.50   1st Qu.: 893.0  
##  Median :25.00   Median :5.045   Median :30.00   Median : 933.0  
##  Mean   :33.75   Mean   :5.175   Mean   :30.94   Mean   : 945.5  
##  3rd Qu.:57.50   3rd Qu.:5.689   3rd Qu.:33.50   3rd Qu.: 994.5  
##  Max.   :74.00   Max.   :9.159   Max.   :43.00   Max.   :1088.0  
##

library(psych)

describe(mydata) #Interpretation of descriptive statistics

##               vars  n    mean      sd  median trimmed     mad    min
## Region*          1 51    5.27    2.45    5.00    5.37    2.97   1.00
## Population       2 51 4876.65 5439.20 3294.00 3813.15 3239.48 454.00
## SATVerbal        3 51  448.16   30.82  443.00  447.29   37.06 397.00
## SATMath          4 51  497.39   34.57  490.00  496.51   40.03 437.00
## Percent          5 51   33.75   24.07   25.00   32.76   28.17   4.00
## StateSpending    6 51    5.18    1.38    5.04    5.02    1.03   2.99
## Salary           7 51   30.94    5.31   30.00   30.63    4.45  22.00
## FullSATScore     8 51  945.55   64.77  933.00  943.85   74.13 834.00
##                    max    range  skew kurtosis     se
## Region*           9.00     8.00 -0.25    -1.11   0.34
## Population    29760.00 29306.00  2.41     7.06 761.64
## SATVerbal       511.00   114.00  0.18    -1.11   4.32
## SATMath         577.00   140.00  0.23    -0.86   4.84
## Percent          74.00    70.00  0.22    -1.63   3.37
## StateSpending     9.16     6.17  0.97     0.71   0.19
## Salary           43.00    21.00  0.51    -0.49   0.74
## FullSATScore   1088.00   254.00  0.22    -1.00   9.07

Interpretation of the descriptive statistics for data frame mydata

Mean for StateSpending: The mean value for StateSpending is approximately 5.18. This means that, on average, states spend around $5,180 per student on public education.
Median for FullSATScore: The median value for FullSATScore is approximately 933.00. It suggests that half of the states have full SAT Scores (SAT Verbal + SAT Math) below this figure, while the other half have full SAT Scores above this figure.
Max. for Percent: The maximum percent of graduating high-school students per state who took the SAT exam is 74.
IQR for Population: 5780 (Q3) - 1215 (Q1) = 4565 (IQR), the IQR signifies that the middle 50% of the states have population sizes within this range. In other words, their populations span from a minimum of 1,215,000 to a maximum of 5,780,000 people.
Min. for Salary: The minimum average teacher’s salary among the states is $22,000.
1st Qu. for SatVerbal: Q1 of 422.5 is indicating that 25% of the states in the dataset had an average SAT verbal scores of 422.5 or below this level while 75% had higher average scores.
3rd Qu. for Population: Q3 of 5780 is indication that 75% of the states have population sizes of 5,780,000 or below this level while 25% have population sizes above this level.
Skewness of Population: A skewness value of 2.41 indicates that the distribution of the “Population” variable is positively skewed, asymmetrical to the right.

mydata$FullSATScoreFactor<- factor(ifelse(mydata$FullSATScore < 933, 0, 1),
                                      levels = c(0, 1), 
                                      labels = c("Below Median", "Above Median")) #Creating a factor that divides the U.S. States based on their full average SAT Score

library(psych)
describeBy(mydata$StateSpending, g = mydata$FullSATScoreFactor)

## 
##  Descriptive statistics by group 
## group: Below Median
##    vars  n mean   sd median trimmed  mad  min  max range skew
## X1    1 25 5.98 1.39    5.5    5.87 1.01 4.24 9.16  4.92  0.8
##    kurtosis   se
## X1     -0.6 0.28
## ---------------------------------------------------- 
## group: Above Median
##    vars  n mean   sd median trimmed  mad  min  max range  skew
## X1    1 26 4.41 0.82    4.4     4.4 1.01 2.99 5.95  2.95 -0.01
##    kurtosis   se
## X1    -1.23 0.16

Interpretation of the descriptive statistics

Group 1: Below median

Mean: The mean value for StateSpending in the Below Median group is approximately 5.98. This means that, on average, this group of states spend around $5,870 per student on public education.
Median: The median value for StateSpending in the Below Median Group is approximately 5.5. It suggests that half of the states in this category have StateSpending below this figure ($5,500), while the other half have full SAT Scores above this figure.
Skewness: A skewness value of 0.8 indicates that the distribution of the StateSpending variable in the Below Median group of States is positively skewed, asymmetrical to the right.

Group 2: Above median

Mean: The mean value for StateSpending in the Below Median group is approximately 4.41. This means that, on average, this group of states spend around $4,410 per student on public education.
Median: The median value for StateSpending in the Below Median Group is approximately 4.4. It suggests that half of the states in this category have StateSpending below this figure ($4,400), while the other half have full SAT Scores above this figure.
Skewness: A skewness value of -0.01 indicates that the distribution of the StateSpending variable in the Above Median group of States is approximately symmetric.

Since StateSpending is a numerical variable, the first assumption is not violated. Following, we will check the normality of distribution of the data through creating a histogram and the Shapiro Wilk test in order to check the second assumption.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = StateSpending)) +
  geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
  facet_wrap(~FullSATScoreFactor, ncol = 1) + 
  ylab("Frequency") #Creating histogram to show the distribution

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.3.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(FullSATScoreFactor) %>%
  shapiro_test(StateSpending) #Shapiro Wilk test to check normality of the data

## # A tibble: 2 × 4
##   FullSATScoreFactor variable      statistic      p
##   <fct>              <chr>             <dbl>  <dbl>
## 1 Below Median       StateSpending     0.898 0.0165
## 2 Above Median       StateSpending     0.959 0.378

Below Median

H0: Variable is normally distributed

H1: Variable is not normally distributed

We reject the null hypothesis at p - value = 0.017 and assume that the variable StateSpending is not normally distributed among the units with below the median full SAT scores.

Above Median

H0: Variable is normally distributed

H1: Variable is not normally distributed

We accept the null hypothesis at p - value = 0.38 and assume that the variable StateSpending is normally distributed among the units with above the median full SAT scores.

Since the second assumption is violated, I have to use nonparametric test. However, now I will proceed with parametric test without interpreting it in order to show the steps.

Parametric test (Independent t-test with Welch correction)

t.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
         paired= FALSE,
         var.equal = FALSE,
         alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$StateSpending by mydata$FullSATScoreFactor
## t = 4.8883, df = 38.449, p-value = 1.835e-05
## alternative hypothesis: true difference in means between group Below Median and group Above Median is not equal to 0
## 95 percent confidence interval:
##  0.9205684 2.2211485
## sample estimates:
## mean in group Below Median mean in group Above Median 
##                   5.976320                   4.405462

library (effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize ::cohens_d(mydata$StateSpending ~ mydata$FullSATScoreFactor,
                    pooled_sd - FALSE)

## Cohen's d |       95% CI
## ------------------------
## 1.38      | [0.76, 1.99]
## 
## - Estimated using pooled SD.

interpret_cohens_d (1.38, rules = "sawilowsky2009")

## [1] "very large"
## (Rules: sawilowsky2009)

Nonparametric test (Wilcoxon Rank Sum Test)

wilcox.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$StateSpending by mydata$FullSATScoreFactor
## W = 541, p-value = 4.703e-05
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location of state spending is the same for the U.S. States with average SAT scores above the median and the ones below the median.

H1: Distribution location of state spending is different for the U.S. States with average SAT scores above the median and the ones below the median.

We reject the null hypothesis at p-value < 0.001 and assume that distribution location of state spending is different for the U.S. States with full SAT scores above the median and the ones below the median.

library(effectsize)
effectsize(wilcox.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.66              | [0.45, 0.81]

interpret_rank_biserial(0.66)

## [1] "very large"
## (Rules: funder2019)

Based on the sample data, we find that U.S. States with average full SAT scores (SAT Verbal and SAT Math) above the median differ in the amount of money spent on public education compared to those with average full SAT scores below the median (p < 0.001). U.S. States with average SAT results below the median spend more funds, and the difference in distribution is very large (𝑟 = 0.66). However, this could be attributed to the fact that the States with full SAT scores above the median have lower percentages of graduating high-school students who took the SAT exam compared to the States with full SAT scores below the median.