data()
data(package = .packages(all.available = TRUE))
#install.packages("carData")
library(carData) #Activating the package carData
## Warning: package 'carData' was built under R version 4.3.2
mydata <- force(States) #Importing a data set called "States" and creating a data frame
head(mydata) #Showing first 6 rows of the newly created data frame
## region pop SATV SATM percent dollars pay
## AL ESC 4041 470 514 8 3.648 27
## AK PAC 550 438 476 42 7.887 43
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 3.334 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
The States data frame contains 51 observations of 7 variables. The observations are the U. S. states and Washington, D. C.
Source of data set: United States (1992) Statistical Abstract of the United States. Bureau of the Census.
Explanation of the variables in the States data set:
region: U.S. Census regions.
pop: The population of each state, measured in thousands (1,000s).
SATV: This represents the average score achieved by graduating high-school students in each state on the verbal component of the Scholastic Aptitude Test (SAT), a widely-recognized university admission exam.
SATM: The average score of graduating high-school students in each state on the math component of the Scholastic Aptitude Test (SAT).
percent: The percentage of graduating high-school students in each state who took the SAT exam.
dollars: State spending on public education, reported in thousands of dollars per student.
pay: The average salary of teachers in each state, measured in thousands of dollars.
colnames(mydata) <- c("Region", "Population", "SATVerbal", "SATMath", "Percent", "StateSpending", "Salary") #Renaming variables in data frame mydata
head(mydata)
## Region Population SATVerbal SATMath Percent StateSpending Salary
## AL ESC 4041 470 514 8 3.648 27
## AK PAC 550 438 476 42 7.887 43
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 3.334 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
mydata$FullSATScore <- (mydata$SATVerbal + mydata$SATMath) #Creating new variable that shows the average full SAT score (SAT Verbal + SAT Math) of each state
head(mydata)
## Region Population SATVerbal SATMath Percent StateSpending Salary
## AL ESC 4041 470 514 8 3.648 27
## AK PAC 550 438 476 42 7.887 43
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 3.334 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
## FullSATScore
## AL 984
## AK 914
## AZ 942
## AR 981
## CA 903
## CO 969
summary(mydata) #Interpretation of descriptive statistics
## Region Population SATVerbal SATMath
## SA : 9 Min. : 454 Min. :397.0 Min. :437.0
## MTN : 8 1st Qu.: 1215 1st Qu.:422.5 1st Qu.:470.0
## WNC : 7 Median : 3294 Median :443.0 Median :490.0
## NE : 6 Mean : 4877 Mean :448.2 Mean :497.4
## ENC : 5 3rd Qu.: 5780 3rd Qu.:474.5 3rd Qu.:522.5
## PAC : 5 Max. :29760 Max. :511.0 Max. :577.0
## (Other):11
## Percent StateSpending Salary FullSATScore
## Min. : 4.00 Min. :2.993 Min. :22.00 Min. : 834.0
## 1st Qu.:11.50 1st Qu.:4.354 1st Qu.:27.50 1st Qu.: 893.0
## Median :25.00 Median :5.045 Median :30.00 Median : 933.0
## Mean :33.75 Mean :5.175 Mean :30.94 Mean : 945.5
## 3rd Qu.:57.50 3rd Qu.:5.689 3rd Qu.:33.50 3rd Qu.: 994.5
## Max. :74.00 Max. :9.159 Max. :43.00 Max. :1088.0
##
library(psych)
describe(mydata) #Interpretation of descriptive statistics
## vars n mean sd median trimmed mad min
## Region* 1 51 5.27 2.45 5.00 5.37 2.97 1.00
## Population 2 51 4876.65 5439.20 3294.00 3813.15 3239.48 454.00
## SATVerbal 3 51 448.16 30.82 443.00 447.29 37.06 397.00
## SATMath 4 51 497.39 34.57 490.00 496.51 40.03 437.00
## Percent 5 51 33.75 24.07 25.00 32.76 28.17 4.00
## StateSpending 6 51 5.18 1.38 5.04 5.02 1.03 2.99
## Salary 7 51 30.94 5.31 30.00 30.63 4.45 22.00
## FullSATScore 8 51 945.55 64.77 933.00 943.85 74.13 834.00
## max range skew kurtosis se
## Region* 9.00 8.00 -0.25 -1.11 0.34
## Population 29760.00 29306.00 2.41 7.06 761.64
## SATVerbal 511.00 114.00 0.18 -1.11 4.32
## SATMath 577.00 140.00 0.23 -0.86 4.84
## Percent 74.00 70.00 0.22 -1.63 3.37
## StateSpending 9.16 6.17 0.97 0.71 0.19
## Salary 43.00 21.00 0.51 -0.49 0.74
## FullSATScore 1088.00 254.00 0.22 -1.00 9.07
Mean for StateSpending: The mean value for StateSpending is approximately 5.18. This means that, on average, states spend around $5,180 per student on public education.
Median for FullSATScore: The median value for FullSATScore is approximately 933.00. It suggests that half of the states have full SAT Scores (SAT Verbal + SAT Math) below this figure, while the other half have full SAT Scores above this figure.
Max. for Percent: The maximum percent of graduating high-school students per state who took the SAT exam is 74.
IQR for Population: 5780 (Q3) - 1215 (Q1) = 4565 (IQR), the IQR signifies that the middle 50% of the states have population sizes within this range. In other words, their populations span from a minimum of 1,215,000 to a maximum of 5,780,000 people.
Min. for Salary: The minimum average teacher’s salary among the states is $22,000.
1st Qu. for SatVerbal: Q1 of 422.5 is indicating that 25% of the states in the dataset had an average SAT verbal scores of 422.5 or below this level while 75% had higher average scores.
3rd Qu. for Population: Q3 of 5780 is indication that 75% of the states have population sizes of 5,780,000 or below this level while 25% have population sizes above this level.
Skewness of Population: A skewness value of 2.41 indicates that the distribution of the “Population” variable is positively skewed, asymmetrical to the right.
mydata$FullSATScoreFactor<- factor(ifelse(mydata$FullSATScore < 933, 0, 1),
levels = c(0, 1),
labels = c("Below Median", "Above Median")) #Creating a factor that divides the U.S. States based on their full average SAT Score
library(psych)
describeBy(mydata$StateSpending, g = mydata$FullSATScoreFactor)
##
## Descriptive statistics by group
## group: Below Median
## vars n mean sd median trimmed mad min max range skew
## X1 1 25 5.98 1.39 5.5 5.87 1.01 4.24 9.16 4.92 0.8
## kurtosis se
## X1 -0.6 0.28
## ----------------------------------------------------
## group: Above Median
## vars n mean sd median trimmed mad min max range skew
## X1 1 26 4.41 0.82 4.4 4.4 1.01 2.99 5.95 2.95 -0.01
## kurtosis se
## X1 -1.23 0.16
Group 1: Below median
Mean: The mean value for StateSpending in the Below Median group is approximately 5.98. This means that, on average, this group of states spend around $5,870 per student on public education.
Median: The median value for StateSpending in the Below Median Group is approximately 5.5. It suggests that half of the states in this category have StateSpending below this figure ($5,500), while the other half have full SAT Scores above this figure.
Skewness: A skewness value of 0.8 indicates that the distribution of the StateSpending variable in the Below Median group of States is positively skewed, asymmetrical to the right.
Group 2: Above median
Mean: The mean value for StateSpending in the Below Median group is approximately 4.41. This means that, on average, this group of states spend around $4,410 per student on public education.
Median: The median value for StateSpending in the Below Median Group is approximately 4.4. It suggests that half of the states in this category have StateSpending below this figure ($4,400), while the other half have full SAT Scores above this figure.
Skewness: A skewness value of -0.01 indicates that the distribution of the StateSpending variable in the Above Median group of States is approximately symmetric.
Since StateSpending is a numerical variable, the first assumption is not violated. Following, we will check the normality of distribution of the data through creating a histogram and the Shapiro Wilk test in order to check the second assumption.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = StateSpending)) +
geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
facet_wrap(~FullSATScoreFactor, ncol = 1) +
ylab("Frequency") #Creating histogram to show the distribution
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(FullSATScoreFactor) %>%
shapiro_test(StateSpending) #Shapiro Wilk test to check normality of the data
## # A tibble: 2 × 4
## FullSATScoreFactor variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Below Median StateSpending 0.898 0.0165
## 2 Above Median StateSpending 0.959 0.378
Below Median
H0: Variable is normally distributed
H1: Variable is not normally distributed
We reject the null hypothesis at p - value = 0.017 and assume that the variable StateSpending is not normally distributed among the units with below the median full SAT scores.
Above Median
H0: Variable is normally distributed
H1: Variable is not normally distributed
We accept the null hypothesis at p - value = 0.38 and assume that the variable StateSpending is normally distributed among the units with above the median full SAT scores.
Since the second assumption is violated, I have to use nonparametric test. However, now I will proceed with parametric test without interpreting it in order to show the steps.
t.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
paired= FALSE,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$StateSpending by mydata$FullSATScoreFactor
## t = 4.8883, df = 38.449, p-value = 1.835e-05
## alternative hypothesis: true difference in means between group Below Median and group Above Median is not equal to 0
## 95 percent confidence interval:
## 0.9205684 2.2211485
## sample estimates:
## mean in group Below Median mean in group Above Median
## 5.976320 4.405462
library (effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize ::cohens_d(mydata$StateSpending ~ mydata$FullSATScoreFactor,
pooled_sd - FALSE)
## Cohen's d | 95% CI
## ------------------------
## 1.38 | [0.76, 1.99]
##
## - Estimated using pooled SD.
interpret_cohens_d (1.38, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)
wilcox.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$StateSpending by mydata$FullSATScoreFactor
## W = 541, p-value = 4.703e-05
## alternative hypothesis: true location shift is not equal to 0
H0: Distribution location of state spending is the same for the U.S. States with average SAT scores above the median and the ones below the median.
H1: Distribution location of state spending is different for the U.S. States with average SAT scores above the median and the ones below the median.
We reject the null hypothesis at p-value < 0.001 and assume that distribution location of state spending is different for the U.S. States with full SAT scores above the median and the ones below the median.
library(effectsize)
effectsize(wilcox.test(mydata$StateSpending ~ mydata$FullSATScoreFactor,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.66 | [0.45, 0.81]
interpret_rank_biserial(0.66)
## [1] "very large"
## (Rules: funder2019)
Based on the sample data, we find that U.S. States with average full SAT scores (SAT Verbal and SAT Math) above the median differ in the amount of money spent on public education compared to those with average full SAT scores below the median (p < 0.001). U.S. States with average SAT results below the median spend more funds, and the difference in distribution is very large (𝑟 = 0.66). However, this could be attributed to the fact that the States with full SAT scores above the median have lower percentages of graduating high-school students who took the SAT exam compared to the States with full SAT scores below the median.